[SCore-users] Races inside SCore(?)

Sun Dec 8 21:00:36 JST 2002

Hi.

Have you tried to execute the program with mpi_zerocopy=on option?
It it works, we have to some workaround to MPI_Iprobe.

Shinji.

From: Shinji Sumimoto <s-sumi at bd6.so-net.ne.jp>
Subject: Re: [SCore-users] Races inside SCore(?)
Date: Thu, 05 Dec 2002 00:03:25 +0900 (JST)
Message-ID: <20021205.000325.74756171.s-sumi at bd6.so-net.ne.jp>

s-sumi> Hi.
s-sumi> 
s-sumi> From: Richard Guenther <rguenth at tat.physik.uni-tuebingen.de>
s-sumi> Subject: [SCore-users] Races inside SCore(?)
s-sumi> Date: Tue, 3 Dec 2002 12:23:18 +0100 (CET)
s-sumi> Message-ID: <Pine.LNX.4.33.0212031214090.19216-100000 at bellatrix.tat.physik.uni-tuebingen.de>
s-sumi> 
s-sumi> rguenth> I experience problems using SCore (version 4.2.1 with 100MBit
s-sumi> rguenth> and 3.3.1 with Myrinet) in
s-sumi> rguenth> conjunction with the cheetah (v1.1.4) library used by POOMA.
s-sumi> rguenth> The problem appears if I use a nx2 processor setup and does
s-sumi> rguenth> not appear in nx1 mode. The problem is all processes spinning
s-sumi> rguenth> in kernel space (>90% system time) and no progress achieved
s-sumi> rguenth> anymore. Does this sound familiar to anyone?
s-sumi> 
s-sumi> rguenth> Now to elaborate some more. Cheetah presents sort of one-sided
s-sumi> rguenth> communication interface to the user and at certain points polls
s-sumi> rguenth> for messages with a construct like (very simplified)
s-sumi> rguenth> 
s-sumi> rguenth>  do {
s-sumi> rguenth>     MPI_Iprobe(MPI_ANY_SOURCE, tag, comm, &flag, &status);
s-sumi> rguenth>  } while (!flag);
s-sumi> rguenth> 
s-sumi> rguenth> now, if I insert a sched_yield() or a usleep(100) after the
s-sumi> rguenth> MPI_Iprobe(), the problem goes away (well, not completely, but
s-sumi> rguenth> it is a lot harder to reproduce). SCore usually does not
s-sumi> rguenth> detect any sort of deadlock, but ocasionally it does.
s-sumi> rguenth> 
s-sumi> rguenth> Now the question, may this be a race condition somewhere in the
s-sumi> rguenth> SCore code that handles multiple processors on one node? Where
s-sumi> rguenth> should I start to look at to fix the problem?
s-sumi> 
s-sumi> We have not been seen such situation. Some race condition or
s-sumi> scheduling problem may occur.  MPI_Iprobe does only check message
s-sumi> queue. When user program is in MPI infinite loop, SCore detects the
s-sumi> loop as deadlock. However, in your case, program seems to be working
s-sumi> well.
s-sumi> 
s-sumi> When the deadlock has occurred, the other process is also running or
s-sumi> sleeping?  Could you attatch and test the program using gdb with
s-sumi> process number?  
s-sumi> 
s-sumi> such as
s-sumi> 
s-sumi> % gdb your-program-binary process-number
s-sumi> 
s-sumi> and test the program using backtrace and step execution.
s-sumi> 
s-sumi> Shinji.
s-sumi> 
s-sumi> rguenth> Thanks for any hints,
s-sumi> rguenth>    Richard.
s-sumi> rguenth> 
s-sumi> rguenth> PS: please CC me, I'm not on the list.
s-sumi> rguenth> 
s-sumi> rguenth> --
s-sumi> rguenth> Richard Guenther <richard.guenther at uni-tuebingen.de>
s-sumi> rguenth> WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/
s-sumi> rguenth> 
s-sumi> rguenth> 
s-sumi> rguenth> _______________________________________________
s-sumi> rguenth> SCore-users mailing list
s-sumi> rguenth> SCore-users at pccluster.org
s-sumi> rguenth> http://www.pccluster.org/mailman/listinfo/score-users
s-sumi> rguenth> 
s-sumi> -----
s-sumi> Shinji Sumimoto    E-Mail: s-sumi at bd6.so-net.ne.jp
s-sumi> 
s-sumi> _______________________________________________
s-sumi> SCore-users mailing list
s-sumi> SCore-users at pccluster.org
s-sumi> http://www.pccluster.org/mailman/listinfo/score-users
s-sumi> 
-----
Shinji Sumimoto    E-Mail: s-sumi at bd6.so-net.ne.jp