[SCore-users-jp] Re: [SCore-users] Races inside SCore(?)

Shinji Sumimoto s-sumi at bd6.so-net.ne.jp
Tue Dec 10 01:36:53 JST 2002


Hi.

Sorry, I understand your environment.

Both of your clusters with myrinet and ethernet have problems.

I will also investigate your problems.

Shinji.

From: Richard Guenther <rguenth at tat.physik.uni-tuebingen.de>
Subject: [SCore-users-jp] Re: [SCore-users] Races inside SCore(?)
Date: Mon, 9 Dec 2002 17:28:04 +0100 (CET)
Message-ID: <Pine.LNX.4.33.0212091721050.10722-100000 at bellatrix.tat.physik.uni-tuebingen.de>

rguenth> On Tue, 10 Dec 2002, Shinji Sumimoto wrote:
rguenth> 
rguenth> > rguenth> Maybe related, I get (sometimes, cannot easily check if its correlated)
rguenth> > rguenth> the following messages from kernel:
rguenth> > rguenth> pmm_mem_read copy failed 0x1 (tsk=f2674000, addr=0x261adc30,
rguenth> > rguenth> ctx=0xf76c0000) err 0x0
rguenth> > rguenth> pmm_mem_read copy failed 0x0 (tsk=f3340000, addr=0x261e5dd0,
rguenth> > rguenth> ctx=0xf7700000) err 0x0
rguenth> >
rguenth> > Which network are you using now?
rguenth> 
rguenth> This is with two EEPRO 100 cards. Note that this is a different cluster
rguenth> that doesnt have Myrinet.
rguenth> 
rguenth> > This message
rguenth> >
rguenth> > pmm_mem_read copy failed 0x1 (tsk=f2674000, addr=0x261adc30,
rguenth> > ctx=0xf76c0000) err 0x0
rguenth> >
rguenth> > seems to be output from PM/Ethernet with mpi_zerocopy=on. Is this true?
rguenth> 
rguenth> Its from PM/Ethernet with mpi_zerocopy (which is equal to
rguenth> mpi_zerocopy=on?).
rguenth> 
rguenth> > The message says PM/Ethernet failed to read data from user memory.
rguenth> 
rguenth> Can this cause problems?
rguenth> 
rguenth> > In PM/Ethernet case, SCore 4.2 has problem with mpi_zerocopy=on, and
rguenth> > sometimes occurred on newer version of SCore. Now, I am re-writing the
rguenth> > feature.
rguenth> >
rguenth> > How about PM/Myrinet case?
rguenth> 
rguenth> I dont see the above kernel messages in the myrinet case. But the main
rguenth> problem (lockup with x2 setup) is there with Myrinet, too.
rguenth> 
rguenth> > scrun -network=myrinet,,,,,
rguenth> >
rguenth> > rguenth> To answer your question, we use mpi_zerocopy as argument to scrun,
rguenth> > rguenth> exchanging that for mpi_zerocopy=on doesnt change things, but specifying
rguenth> > rguenth> mpi_zerocopy=off leads to immediate failure (but again, the failure is
rguenth> > rguenth> only with Nx2 setups, not with Nx1, also 1x2 seems to be ok).
rguenth> > rguenth>
rguenth> > rguenth> Is there a option to SCore to allow tracing/logging of the MPI functions
rguenth> > rguenth> called? Maybe we can see a pattern to the problem.
rguenth> >
rguenth> > MPICH/SCore also has -mpi_log option like MPICH.  But, if the program
rguenth> > runs a couple of minuites, logs becomes huge and difficult to analize.
rguenth> 
rguenth> I can reproduce the failure within seconds, so this is probably not a
rguenth> problem. I will experiment with this later this week.
rguenth> 
rguenth> > rguenth> If I change the computation/communication order in my program I cannot
rguenth> > rguenth> reproduce the failures. But in this mode requests never accumulate so
rguenth> > rguenth> its just a lot less hammering on the MPI backend.
rguenth> >
rguenth> > Could you let us know your runtime environment more?
rguenth> > PM/Myrinet problems or PM/Ethernet problems?
rguenth> 
rguenth> Its both, myrinet and ethernet problems (the myrinet is on a dual PIII
rguenth> cluster with score V3.3.1 running kernel 2.2.16, the ethernet on a dual
rguenth> Athlon cluster with score V4.2.1 running kernel 2.4.18-rc4).

-----
Shinji Sumimoto    E-Mail: s-sumi at bd6.so-net.ne.jp



More information about the SCore-users mailing list