[SCore-users] Races inside SCore(?)

Richard Guenther rguenth at tat.physik.uni-tuebingen.de
Tue Dec 10 01:28:04 JST 2002


On Tue, 10 Dec 2002, Shinji Sumimoto wrote:

> rguenth> Maybe related, I get (sometimes, cannot easily check if its correlated)
> rguenth> the following messages from kernel:
> rguenth> pmm_mem_read copy failed 0x1 (tsk=f2674000, addr=0x261adc30,
> rguenth> ctx=0xf76c0000) err 0x0
> rguenth> pmm_mem_read copy failed 0x0 (tsk=f3340000, addr=0x261e5dd0,
> rguenth> ctx=0xf7700000) err 0x0
>
> Which network are you using now?

This is with two EEPRO 100 cards. Note that this is a different cluster
that doesnt have Myrinet.

> This message
>
> pmm_mem_read copy failed 0x1 (tsk=f2674000, addr=0x261adc30,
> ctx=0xf76c0000) err 0x0
>
> seems to be output from PM/Ethernet with mpi_zerocopy=on. Is this true?

Its from PM/Ethernet with mpi_zerocopy (which is equal to
mpi_zerocopy=on?).

> The message says PM/Ethernet failed to read data from user memory.

Can this cause problems?

> In PM/Ethernet case, SCore 4.2 has problem with mpi_zerocopy=on, and
> sometimes occurred on newer version of SCore. Now, I am re-writing the
> feature.
>
> How about PM/Myrinet case?

I dont see the above kernel messages in the myrinet case. But the main
problem (lockup with x2 setup) is there with Myrinet, too.

> scrun -network=myrinet,,,,,
>
> rguenth> To answer your question, we use mpi_zerocopy as argument to scrun,
> rguenth> exchanging that for mpi_zerocopy=on doesnt change things, but specifying
> rguenth> mpi_zerocopy=off leads to immediate failure (but again, the failure is
> rguenth> only with Nx2 setups, not with Nx1, also 1x2 seems to be ok).
> rguenth>
> rguenth> Is there a option to SCore to allow tracing/logging of the MPI functions
> rguenth> called? Maybe we can see a pattern to the problem.
>
> MPICH/SCore also has -mpi_log option like MPICH.  But, if the program
> runs a couple of minuites, logs becomes huge and difficult to analize.

I can reproduce the failure within seconds, so this is probably not a
problem. I will experiment with this later this week.

> rguenth> If I change the computation/communication order in my program I cannot
> rguenth> reproduce the failures. But in this mode requests never accumulate so
> rguenth> its just a lot less hammering on the MPI backend.
>
> Could you let us know your runtime environment more?
> PM/Myrinet problems or PM/Ethernet problems?

Its both, myrinet and ethernet problems (the myrinet is on a dual PIII
cluster with score V3.3.1 running kernel 2.2.16, the ethernet on a dual
Athlon cluster with score V4.2.1 running kernel 2.4.18-rc4).

Richard.

--
Richard Guenther <richard.guenther at uni-tuebingen.de>
WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/





More information about the SCore-users mailing list