[SCore-users-jp] [SCore-users] Two problems with SCore

Jure Jerman jure.jerman @ rzs-hm.si
2005年 5月 18日 (水) 19:46:03 JST


Hello,

after long time of inactivity we started to deal with ULT:Panic problem 
again (see score-mailing-list
http://www.pccluster.org/pipermail/score-users/2004-December/002305.html)

Just to recall the situation:
-we are running SCore 5.8.2 on 14 node dual Xeon cluster
-we had sudden (and unrepeatable) resets of score where score died with
ULT:PANIC error.

At that time the conclusion was that we can avoid ULT:PANIC type of 
errors with running
application not just on one node but on several.

Now we have the same time of problems (ULT:PANIC) with the code compiled 
with Intel
fortran compiler even running on several nodes. The very same code 
compiled with
Lahey/Fujitsu compiler runs fine. The score crashes are unrepeatable and 
quite often.

Just to avoid the option that there is something wrong with network 
hardware we did
tests with two different sets of network cards and switched but result 
was the same so we
can say with pretty high degree of accuracy that it is not network 
hardware problem.

We even separated score network from I/O network (every node has two 
NICs, one is dedicated
for SCore), the other for tcp-ip, nfs, ...

The dump of a crash is attached. What makes the whole story really 
confusing is the fact
that Fujitsu compiled binary runs fine and that problems are unrepeatable.


We have  another problem (which is not that important): what could be 
the reason, that we can not checkpoint staticaly
linked application compiled with intel compiler. If the checkpointing is 
triggered via sc_console
the code just runs on.

I would be very gratefull for any clue about problems. We are specially 
interested in
solving the first one.


Thank you,

Jure Jerman
Environmental Agency of Slovenia
_______________________________________________
SCore-users mailing list
SCore-users @ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users



SCore-users-jp メーリングリストの案内