[SCore-users-jp] [SCore-users] Two problems with SCore
Jure Jerman
jure.jerman @ rzs-hm.si
2005年 5月 18日 (水) 19:46:03 JST
Hello,
after long time of inactivity we started to deal with ULT:Panic problem
again (see score-mailing-list
http://www.pccluster.org/pipermail/score-users/2004-December/002305.html)
Just to recall the situation:
-we are running SCore 5.8.2 on 14 node dual Xeon cluster
-we had sudden (and unrepeatable) resets of score where score died with
ULT:PANIC error.
At that time the conclusion was that we can avoid ULT:PANIC type of
errors with running
application not just on one node but on several.
Now we have the same time of problems (ULT:PANIC) with the code compiled
with Intel
fortran compiler even running on several nodes. The very same code
compiled with
Lahey/Fujitsu compiler runs fine. The score crashes are unrepeatable and
quite often.
Just to avoid the option that there is something wrong with network
hardware we did
tests with two different sets of network cards and switched but result
was the same so we
can say with pretty high degree of accuracy that it is not network
hardware problem.
We even separated score network from I/O network (every node has two
NICs, one is dedicated
for SCore), the other for tcp-ip, nfs, ...
The dump of a crash is attached. What makes the whole story really
confusing is the fact
that Fujitsu compiled binary runs fine and that problems are unrepeatable.
We have another problem (which is not that important): what could be
the reason, that we can not checkpoint staticaly
linked application compiled with intel compiler. If the checkpointing is
triggered via sc_console
the code just runs on.
I would be very gratefull for any clue about problems. We are specially
interested in
solving the first one.
Thank you,
Jure Jerman
Environmental Agency of Slovenia
_______________________________________________
SCore-users mailing list
SCore-users @ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users
SCore-users-jp メーリングリストの案内