[SCore-users-jp] [SCore-users] Myrinet deadlock

Bogdan Costescu bogdan.costescu @ iwr.uni-heidelberg.de
2003年 3月 5日 (水) 03:45:35 JST


Dear SCore developers,

When trying to test SCore 5.4, I get what it looks like a deadlock when 
using Myrinet, association with Shmem making it appear faster.

Setup:
The cluster is composed of 8 nodes, each with dual Athlon, 512 MB RAM and 
older Myrinet (LANai 4) cards. I kept the configuration files from an 
older (4.2.1) SCore installation which worked flawlessly for more than a 
year, so I believe that there are no errors in this part. I installed the 
kernel RPM provided in the distribution, but compiled here all the 
user-level stuff.

The problem:
When trying to run a job that uses Myrinet with or without Shmem 
(-nodes=8x2 or -nodes=8x1) the job locks at random places. When running a 
job that uses Ethernet (either -nodes=8x1 or -nodes=8x2) the lockup does 
not occur even if I put more load on the nodes, like starting several jobs 
at the same time on the same nodes.
When the job is in this state, it can sometimes (but not always) be 
interrupted with Ctrl-C (if it's still connected to the terminal). But 
sometimes not even pskill is able to get rid of it, the message indicating 
that the job is killed appears every time pskill is executed, but the job 
is still there - at some point SCoreD dies and it's restarted by sc_watch.

Attaching gdb to the job in this state gives something like:

#0  0x082c2702 in shmemReceive ()
#1  0x082b2f4d in composite_attach_context ()
#2  0x0829a485 in MPID_SCORE_Recv_Message ()
#3  0x082999f6 in MPID_SCORE_PIwrecv ()
#4  0x08299754 in MPID_SCORE_PIbrecv ()
#5  0x0829e2b1 in MPID_CH_Check_incoming ()
#6  0x082948d7 in MPID_RecvComplete ()
#7  0x0828a1ff in PMPI_Waitall ()

or

#0  0x082c6080 in myriReceive ()
#1  0x0829a485 in MPID_SCORE_Recv_Message ()
#2  0x082999f6 in MPID_SCORE_PIwrecv ()
#3  0x08299754 in MPID_SCORE_PIbrecv ()
#4  0x0829e2b1 in MPID_CH_Check_incoming ()
#5  0x082948d7 in MPID_RecvComplete ()
#6  0x0828a1ff in PMPI_Waitall ()

from which I assume that this is a deadlock. However, the application that 
produced this (CHARMM) is very stable and worked flawlessly with older 
versions of SCore, so deadlocks caused by bad programming in the 
application are to be excluded.

The /proc/pm/myrinet/0/info file on all nodes indicates 0; we never had 
any problems with these cards with older SCore versions.

Do you have any idea about what is going on ?

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu @ IWR.Uni-Heidelberg.De


_______________________________________________
SCore-users mailing list
SCore-users @ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users



SCore-users-jp メーリングリストの案内