[SCore-users-jp] Re: [SCore-users] Benchmarking 256 processor problem

Shinji Sumimoto s-sumi @ bd6.so-net.ne.jp
2003年 3月 21日 (金) 12:51:09 JST


Hi.

Thank you for the information.

Is the situation occurred with mpi_zerocopy=on option? 

If it is not occurred, there are someting wrong in message transfer on
PM or MPI level implementation.

If same problem is occurred in small size of cluster(ex 16 nodes),
please let us know. We can re-produce the situation and fix it.

PS: New version of MPICH, version 1.2.5, includes new version of
    mpi_bcast, so, your problem may be solved. We will try to port it to
    SCore.

Shinji. 

From: Michael Rudgyard Streamline <michael @ streamline-computing.com>
Subject: [SCore-users] Benchmarking 256 processor problem
Date: Tue, 18 Mar 2003 21:30:59 +0000 (GMT)
Message-ID: <Pine.LNX.4.21.0303182035170.19595-100000 @ scgj.streamline>

michael> 
michael> We have a customer who has an interesting problem with MPI_Bcast,
michael> where SCore seems to hang on large numbers of processors,. and in
michael> particular when there are 2 processes per node running. The code
michael> segment is provided below.
michael> 
michael> My understanding is that the MPICH (and hence the SCore) implementation of
michael> MPI_Bcast is globally asynchronous, and is built using MPI_Send. It is
michael> therefore possible that (in the example below, and in a worse case) the
michael> 256th processor may have yet to receive messages from all other
michael> (255) processors. I suspect that this may be problematic because there a
michael> maximum number of message buffers that may be sent at a given time. I know
michael> this was the case on SGI and Cray systems, and I think this is the case
michael> with MPICH but can't find the corresponding environment variables on the
michael> MPICH web-site.
michael> 
michael> As far as I am aware, MPI_Send will block if the send cannot be buffered
michael> (so I assume this is the case for MPI_BCast), and given that MPI_BCast is
michael> called in the correct order for each processor (avoiding the well-known
michael> deadlock situations), I can't see why this code should necessarily cause
michael> the code to hang (???) other than there being potentially a lot of
michael> messages floating around... This leads me to believe that it must just be
michael> the number of outstanding messages that is the problem, although in that
michael> case shouldn't the corresponding MPI_BCast block at the senders
michael> side ? Could there be an issue in particular due to messages sent via 
michael> shared memory (ie a performance vs. correctness issue) ?
michael> 
michael> For info, each send is about a Kilobyte of information.
michael> 
michael> Note that making the broadcast synchonous, ie. by adding an MPI_Barrier,
michael> we solve the problem. 
michael> 
michael> The machine is running Score 5.0.1 with MPI 1.2.4 over Myrinet 2000
michael> (M3F-PCI64-B 2MB).
michael> 
michael> Thanks in advance,
michael> 
michael> Michael 
michael> 
michael> ----------------
michael> 
michael> The code ran fine on up to 128 processors when tested on one process per
michael> node.  It also ran fine on 2 processes per node on up to 32 nodes (ie 64
michael> processes).  However when run on 64x2 then the code would "stop" at
michael> differing points, normally within a minute of execution of an hour long
michael> job.  By "stop"  I mean the processes would remain at 100% CPU but no work
michael> was being done, as though a process was waiting for a message.
michael> 
michael> Reason
michael> ------
michael> 
michael> Our investigations this afternoon has led us to believe that it comes down
michael> to a loop of MPI_Bcasts:
michael>             DO 300 p = 0, noprocs-1
michael>                JSTART = p*JMAX/noprocs+1
michael>                JFINISH = (p+1)*JMAX/noprocs
michael>                npts = IMAX*(JFINISH-JSTART+1)
michael>                CALL MPI_Bcast(U(1,JSTART), npts,
michael>      :                     MPI_DOUBLE_PRECISION, p, MPI_COMM_WORLD,
michael>      :                     error)
michael>   300       CONTINUE
michael> This broadcast simply sends the next processors chunk of the array to all
michael> the other processors.  An AllToAll would be similar, however this was used
michael> to give better control over the number of messages being sent at any time.
michael> 
michael> However, it appears that this isn't the case.  By adding an MPI_Barrier
michael> call after the MPI_Bcast the problem of the "stopping" wasn't repeated in
michael> our tests.
michael> 
michael> 
michael> _______________________________________________
michael> SCore-users mailing list
michael> SCore-users @ pccluster.org
michael> http://www.pccluster.org/mailman/listinfo/score-users
michael> 
------
Shinji Sumimoto, Fujitsu Labs
_______________________________________________
SCore-users mailing list
SCore-users @ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users



SCore-users-jp メーリングリストの案内