[SCore-users] Races inside SCore(?)

Shinji Sumimoto s-sumi at bd6.so-net.ne.jp
Thu Dec 5 00:03:25 JST 2002


Hi.

From: Richard Guenther <rguenth at tat.physik.uni-tuebingen.de>
Subject: [SCore-users] Races inside SCore(?)
Date: Tue, 3 Dec 2002 12:23:18 +0100 (CET)
Message-ID: <Pine.LNX.4.33.0212031214090.19216-100000 at bellatrix.tat.physik.uni-tuebingen.de>

rguenth> I experience problems using SCore (version 4.2.1 with 100MBit
rguenth> and 3.3.1 with Myrinet) in
rguenth> conjunction with the cheetah (v1.1.4) library used by POOMA.
rguenth> The problem appears if I use a nx2 processor setup and does
rguenth> not appear in nx1 mode. The problem is all processes spinning
rguenth> in kernel space (>90% system time) and no progress achieved
rguenth> anymore. Does this sound familiar to anyone?

rguenth> Now to elaborate some more. Cheetah presents sort of one-sided
rguenth> communication interface to the user and at certain points polls
rguenth> for messages with a construct like (very simplified)
rguenth> 
rguenth>  do {
rguenth>     MPI_Iprobe(MPI_ANY_SOURCE, tag, comm, &flag, &status);
rguenth>  } while (!flag);
rguenth> 
rguenth> now, if I insert a sched_yield() or a usleep(100) after the
rguenth> MPI_Iprobe(), the problem goes away (well, not completely, but
rguenth> it is a lot harder to reproduce). SCore usually does not
rguenth> detect any sort of deadlock, but ocasionally it does.
rguenth> 
rguenth> Now the question, may this be a race condition somewhere in the
rguenth> SCore code that handles multiple processors on one node? Where
rguenth> should I start to look at to fix the problem?

We have not been seen such situation. Some race condition or
scheduling problem may occur.  MPI_Iprobe does only check message
queue. When user program is in MPI infinite loop, SCore detects the
loop as deadlock. However, in your case, program seems to be working
well.

When the deadlock has occurred, the other process is also running or
sleeping?  Could you attatch and test the program using gdb with
process number?  

such as

% gdb your-program-binary process-number

and test the program using backtrace and step execution.

Shinji.

rguenth> Thanks for any hints,
rguenth>    Richard.
rguenth> 
rguenth> PS: please CC me, I'm not on the list.
rguenth> 
rguenth> --
rguenth> Richard Guenther <richard.guenther at uni-tuebingen.de>
rguenth> WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/
rguenth> 
rguenth> 
rguenth> _______________________________________________
rguenth> SCore-users mailing list
rguenth> SCore-users at pccluster.org
rguenth> http://www.pccluster.org/mailman/listinfo/score-users
rguenth> 
-----
Shinji Sumimoto    E-Mail: s-sumi at bd6.so-net.ne.jp




More information about the SCore-users mailing list