[SCore-users-jp] Re: [SCore-users] Two problems with SCore
Shinji Sumimoto
s-sumi @ flab.fujitsu.co.jp
2005年 5月 21日 (土) 14:00:58 JST
Hi.
Sorry for late response.
From: Jure Jerman <jure.jerman @ rzs-hm.si>
Subject: [SCore-users] Two problems with SCore
Date: Wed, 18 May 2005 12:46:03 +0200
Message-ID: <428B1CEB.1090608 @ rzs-hm.si>
jure.jerman> Hello,
jure.jerman>
jure.jerman> after long time of inactivity we started to deal with ULT:Panic problem
jure.jerman> again (see score-mailing-list
jure.jerman> http://www.pccluster.org/pipermail/score-users/2004-December/002305.html)
jure.jerman>
jure.jerman> Just to recall the situation:
jure.jerman> -we are running SCore 5.8.2 on 14 node dual Xeon cluster
jure.jerman> -we had sudden (and unrepeatable) resets of score where score died with
jure.jerman> ULT:PANIC error.
jure.jerman> At that time the conclusion was that we can avoid ULT:PANIC type of
jure.jerman> errors with running
jure.jerman> application not just on one node but on several.
jure.jerman>
jure.jerman> Now we have the same time of problems (ULT:PANIC) with the code compiled
jure.jerman> with Intel
jure.jerman> fortran compiler even running on several nodes. The very same code
jure.jerman> compiled with
jure.jerman> Lahey/Fujitsu compiler runs fine. The score crashes are unrepeatable and
jure.jerman> quite often.
jure.jerman>
jure.jerman> Just to avoid the option that there is something wrong with network
jure.jerman> hardware we did
jure.jerman> tests with two different sets of network cards and switched but result
jure.jerman> was the same so we
jure.jerman> can say with pretty high degree of accuracy that it is not network
jure.jerman> hardware problem.
Usually, ULT:PANIC errors come from comminication errors, not depend
on compiler. You are using PM/Ethernet, right? If so, please add
checksum option on your pm-ethernet.conf, and try to run the program.
jure.jerman> We even separated score network from I/O network (every node has two
jure.jerman> NICs, one is dedicated
jure.jerman> for SCore), the other for tcp-ip, nfs, ...
jure.jerman>
jure.jerman> The dump of a crash is attached. What makes the whole story really
jure.jerman> confusing is the fact
jure.jerman> that Fujitsu compiled binary runs fine and that problems are unrepeatable.
jure.jerman> We have another problem (which is not that important): what could be
jure.jerman> the reason, that we can not checkpoint staticaly
jure.jerman> linked application compiled with intel compiler. If the checkpointing is
jure.jerman> triggered via sc_console
jure.jerman> the code just runs on.
Which version of intel compiler are you using? If you are using the
versions after 8.0, checkpoint function is not able to use because
the versions of Intel compilers use pthread library. SCore does not
support checkpoint function on pthread binnaries, now.
jure.jerman> I would be very gratefull for any clue about problems. We are specially
jure.jerman> interested in
jure.jerman> solving the first one.
jure.jerman>
jure.jerman>
jure.jerman> Thank you,
jure.jerman>
jure.jerman> Jure Jerman
jure.jerman> Environmental Agency of Slovenia
jure.jerman> _______________________________________________
jure.jerman> SCore-users mailing list
jure.jerman> SCore-users @ pccluster.org
jure.jerman> http://www.pccluster.org/mailman/listinfo/score-users
------
Shinji Sumimoto, Fujitsu Labs
_______________________________________________
SCore-users mailing list
SCore-users @ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users
SCore-users-jp メーリングリストの案内