[SCore-users-jp] [SCore-users] Benchmarking 256 processor problem

2003年 3月 19日 (水) 06:30:59 JST

We have a customer who has an interesting problem with MPI_Bcast,
where SCore seems to hang on large numbers of processors,. and in
particular when there are 2 processes per node running. The code
segment is provided below.

My understanding is that the MPICH (and hence the SCore) implementation of
MPI_Bcast is globally asynchronous, and is built using MPI_Send. It is
therefore possible that (in the example below, and in a worse case) the
256th processor may have yet to receive messages from all other
(255) processors. I suspect that this may be problematic because there a
maximum number of message buffers that may be sent at a given time. I know
this was the case on SGI and Cray systems, and I think this is the case
with MPICH but can't find the corresponding environment variables on the
MPICH web-site.

As far as I am aware, MPI_Send will block if the send cannot be buffered
(so I assume this is the case for MPI_BCast), and given that MPI_BCast is
called in the correct order for each processor (avoiding the well-known
deadlock situations), I can't see why this code should necessarily cause
the code to hang (???) other than there being potentially a lot of
messages floating around... This leads me to believe that it must just be
the number of outstanding messages that is the problem, although in that
case shouldn't the corresponding MPI_BCast block at the senders
side ? Could there be an issue in particular due to messages sent via 
shared memory (ie a performance vs. correctness issue) ?

For info, each send is about a Kilobyte of information.

Note that making the broadcast synchonous, ie. by adding an MPI_Barrier,
we solve the problem. 

The machine is running Score 5.0.1 with MPI 1.2.4 over Myrinet 2000
(M3F-PCI64-B 2MB).

Thanks in advance,

Michael 

----------------

The code ran fine on up to 128 processors when tested on one process per
node.  It also ran fine on 2 processes per node on up to 32 nodes (ie 64
processes).  However when run on 64x2 then the code would "stop" at
differing points, normally within a minute of execution of an hour long
job.  By "stop"  I mean the processes would remain at 100% CPU but no work
was being done, as though a process was waiting for a message.

Reason
------

Our investigations this afternoon has led us to believe that it comes down
to a loop of MPI_Bcasts:
            DO 300 p = 0, noprocs-1
               JSTART = p*JMAX/noprocs+1
               JFINISH = (p+1)*JMAX/noprocs
               npts = IMAX*(JFINISH-JSTART+1)
               CALL MPI_Bcast(U(1,JSTART), npts,
     :                     MPI_DOUBLE_PRECISION, p, MPI_COMM_WORLD,
     :                     error)
  300       CONTINUE
This broadcast simply sends the next processors chunk of the array to all
the other processors.  An AllToAll would be similar, however this was used
to give better control over the number of messages being sent at any time.

However, it appears that this isn't the case.  By adding an MPI_Barrier
call after the MPI_Bcast the problem of the "stopping" wasn't repeated in
our tests.

_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users