[SCore-users-jp] Re: [SCore-users] Two problems with SCore

Shinji Sumimoto s-sumi @ flab.fujitsu.co.jp
2005年 5月 21日 (土) 14:00:58 JST


Hi.

Sorry for late response.

From: Jure Jerman <jure.jerman @ rzs-hm.si>
Subject: [SCore-users] Two problems with SCore
Date: Wed, 18 May 2005 12:46:03 +0200
Message-ID: <428B1CEB.1090608 @ rzs-hm.si>

jure.jerman> Hello,
jure.jerman> 
jure.jerman> after long time of inactivity we started to deal with ULT:Panic problem 
jure.jerman> again (see score-mailing-list
jure.jerman> http://www.pccluster.org/pipermail/score-users/2004-December/002305.html)
jure.jerman> 
jure.jerman> Just to recall the situation:
jure.jerman> -we are running SCore 5.8.2 on 14 node dual Xeon cluster
jure.jerman> -we had sudden (and unrepeatable) resets of score where score died with
jure.jerman> ULT:PANIC error.

jure.jerman> At that time the conclusion was that we can avoid ULT:PANIC type of 
jure.jerman> errors with running
jure.jerman> application not just on one node but on several.
jure.jerman> 
jure.jerman> Now we have the same time of problems (ULT:PANIC) with the code compiled 
jure.jerman> with Intel
jure.jerman> fortran compiler even running on several nodes. The very same code 
jure.jerman> compiled with
jure.jerman> Lahey/Fujitsu compiler runs fine. The score crashes are unrepeatable and 
jure.jerman> quite often.

jure.jerman> 
jure.jerman> Just to avoid the option that there is something wrong with network 
jure.jerman> hardware we did
jure.jerman> tests with two different sets of network cards and switched but result 
jure.jerman> was the same so we
jure.jerman> can say with pretty high degree of accuracy that it is not network 
jure.jerman> hardware problem.

Usually, ULT:PANIC errors come from comminication errors, not depend
on compiler. You are using PM/Ethernet, right?  If so, please add
checksum option on your pm-ethernet.conf, and try to run the program.

jure.jerman> We even separated score network from I/O network (every node has two 
jure.jerman> NICs, one is dedicated
jure.jerman> for SCore), the other for tcp-ip, nfs, ...
jure.jerman> 
jure.jerman> The dump of a crash is attached. What makes the whole story really 
jure.jerman> confusing is the fact
jure.jerman> that Fujitsu compiled binary runs fine and that problems are unrepeatable.

jure.jerman> We have  another problem (which is not that important): what could be 
jure.jerman> the reason, that we can not checkpoint staticaly
jure.jerman> linked application compiled with intel compiler. If the checkpointing is 
jure.jerman> triggered via sc_console
jure.jerman> the code just runs on.

Which version of intel compiler are you using? If you are using the
 versions after 8.0, checkpoint function is not able to use because
 the versions of Intel compilers use pthread library. SCore does not
support checkpoint function on pthread binnaries, now.

jure.jerman> I would be very gratefull for any clue about problems. We are specially 
jure.jerman> interested in
jure.jerman> solving the first one.
jure.jerman> 
jure.jerman> 
jure.jerman> Thank you,
jure.jerman> 
jure.jerman> Jure Jerman
jure.jerman> Environmental Agency of Slovenia
jure.jerman> _______________________________________________
jure.jerman> SCore-users mailing list
jure.jerman> SCore-users @ pccluster.org
jure.jerman> http://www.pccluster.org/mailman/listinfo/score-users
------
Shinji Sumimoto, Fujitsu Labs
_______________________________________________
SCore-users mailing list
SCore-users @ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users



SCore-users-jp メーリングリストの案内