[SCore-users] Bug found!?

Shinji Sumimoto s-sumi at flab.fujitsu.co.jp
Fri Sep 27 11:27:15 JST 2002


Hi.

Thank you for give us your testing information.

From: Amik St-Cyr CFD Lab <amik at cfdlab.mcgill.ca>
Subject: Re: [SCore-users] Bug found!?
Date: 26 Sep 2002 11:15:28 -0400
Message-ID: <1033053328.1230.28.camel at stan.cfdlab.mcgill.ca>

amik> 
amik> On Thu, 2002-09-26 at 10:48, Shinji Sumimoto wrote:
amik> > Hi.
amik> > 
amik> > Thank you very much for sending the test results.
amik> > Could you test following sequences?
amik> > 
amik> > 1) communication tests:
amik> > 
amik> > recv$ rpmtest cn1 myrinet2k -sink
amik> > recv$ rpmtest cn2 myrinet2k -dest 0 -busrt -len 8192
amik> 
amik> Shell A:
amik> | amik at stokes 11:14:01 sbin> ./rpmtest cn1 myrinet2k -sink
amik> 
amik> Shell B:
amik> | amik at stokes 11:15:03 sbin> ./rpmtest cn2 myrinet2k -dest 0 -burst -len
amik> 8192
amik> 8192	7.40536e+07

This result is too slow for myrinet2k.

amik> 
amik> 

I understand your problems.

Your problems occur only Athlon 760MP-X chipset, not 760MP.

Could you apply following patches to SCore source and re-build whole
of SCore packages? If you can't re-build, please wait
next-release (maybe November). 

Following URL shows how to re-build SCore from source.
http://www.pccluster.org/score/dist/score/html/en/installation/index.html

Shinji.

# cd /opt/score/score-src/SCore/pm2/arch/myrinet2k/lib/

and apply following patch:
=====================================================================
Index: pm_myri2k_rma.c
===================================================================
RCS file: /develop/cvsroot/score-src/SCore/pm2/arch/myrinet2k/lib/pm_myri2k_rma.c,v
retrieving revision 1.7
retrieving revision 1.8
diff -u -r1.7 -r1.8
--- pm_myri2k_rma.c	8 Aug 2002 11:24:35 -0000	1.7
+++ pm_myri2k_rma.c	27 Sep 2002 02:11:36 -0000	1.8
@@ -430,6 +430,15 @@
 	mp->loc_addr = htonl(PM_HANDLE_COMPRESS(PM_HANDLE_ADDR(loc_hndl)));
 	mp->loc_proc = htonl(PM_HANDLE_PROC(loc_hndl));
 	mp->rmt_proc = htonl(PM_HANDLE_PROC(rmt_hndl));
+	mb();
+	{
+	  u_int loc_addr1 = htonl(PM_HANDLE_COMPRESS(PM_HANDLE_ADDR(loc_hndl)));
+
+	  while (mp->loc_addr != loc_addr1) {
+	    mp->loc_addr = loc_addr1;
+	    mb();
+	  }
+	}
 	mp->length = htonl(MSG_LENGTH_ENCODE(MSG_VREAD_REQUEST, length));
 #ifdef __alpha__
 	mb();
Index: pm_myri2k_sys.c
===================================================================
RCS file: /develop/cvsroot/score-src/SCore/pm2/arch/myrinet2k/lib/pm_myri2k_sys.c,v
retrieving revision 1.8
retrieving revision 1.9
@@ -638,6 +638,9 @@
 	memset((caddr_t)sc, 0, CTX_SHARED_INT_SIZE);
 	memset((caddr_t)sc->sc_send_buffer, 0, SEND_BUF_LENGTH);
 	memset((caddr_t)sc->sc_recv_buffer, 0, RECV_BUF_LENGTH);
+
+	/* ad-hoc fix for Athlon MP */
+	system("ps >/dev/null");
 
 	if ((error = myriInitSharedContextPhys(md, mc)) != PM_SUCCESS) {
 		pmDebug(PM_ERROR,

=====================================================================


amik> > 
amik> > recv$ rpmtest cn1 myrinet2k -vreply
amik> > recv$ rpmtest cn2 myrinet2k -dest 0 -vwrite -len 524288 -iter 10000
amik> > recv$ rpmtest cn2 myrinet2k -dest 0 -vread -len 524288 -iter 10000
amik> > 
amik> 
amik> Shell A:
amik> | amik at stokes 11:16:33 sbin> ./rpmtest cn1 myrinet2k -vreply
amik> 
amik> Shell B:
amik> 
amik> (VWRITE)
amik> | amik at stokes 11:17:01 sbin> ./rpmtest cn2 myrinet2k -vwrite -len 524288
amik> -iter 10000
amik> [1] Send Error: myriWrite: Invalid argument(22)(30016): 00000001
amik> 00000000 00000000 000011cb 083f9000 00080000 00000000 00000000 
amik> [1] chan=0, crc=0, unknown=0, nres=0, arep=0
amik> [1] recv: recv=0, ack=0, nack=0, write=0, wack=0, read=0, rreply=0,
amik> discard=0
amik> [1] recv: put_addr=32, get_addr=32
amik> [1] recv: error=0(0), data=0:0(0) 1:0(0) 2:0(0) 3:0(0) 4:0(0) 5:0(0)
amik> 6:0(0) 7:0(0) 
amik> [1] send: send=0, ack=0, nack=0, write=0, wack=0, read=0, rreply=0,
amik> resend=0,0
amik> [1] send: request=0, disable=ffffffff, deactivated=ffffffff,
amik> last_write=(0, 0), last_read=(0, 0)
amik> [1] send: putp=5, getp=1, relp=1
amik> [1] send: error=0(0), data=0:1(1) 1:0(0) 2:0(0) 3:4555(11cb)
amik> 4:138383360(83f9000) 5:524288(80000) 6:0(0) 7:0(0) 
amik> [1] retry: count=0, putp=0, getp=0, request=0
amik> 
amik> [1] reply: putp=0, getp=0, request=0
amik> 
amik> waiting ack
amik> 
amik> waiting to send
amik> [1]   1: type=VWRITE, dest=1, seq=0x0, rmt_proc=4555, rmt_addr=83f9000,
amik> loc_proc=0, loc_addr=0, length=80000, retry=0
amik> [1]   2: type=VWRITE, dest=1, seq=0x0, rmt_proc=4555, rmt_addr=83f9000,
amik> loc_proc=0, loc_addr=0, length=80000, retry=0
amik> [1]   3: type=VWRITE, dest=1, seq=0x0, rmt_proc=4555, rmt_addr=83f9000,
amik> loc_proc=0, loc_addr=0, length=80000, retry=0
amik> [1]   4: type=VWRITE, dest=1, seq=0x0, rmt_proc=4555, rmt_addr=83f9000,
amik> loc_proc=0, loc_addr=0, length=80000, retry=0
amik> 
amik> message ack info
amik> [1] REL-MSG:(1) sack=0, rack=1, stat=8 
amik> rma ack info
amik> pmWrite: Invalid argument(22)
amik> 
amik> (VREAD)
amik> 
amik> | amik at stokes 11:17:27 sbin> ./rpmtest cn2 myrinet2k -vread -len 524288
amik> -iter 10000
amik> [1] Send Error: myriRead: Invalid argument(22)(40016): 00000001 00000000
amik> 00000000 00001261 083f9000 00080000 00000000 00000000 
amik> [1] chan=0, crc=0, unknown=0, nres=0, arep=0
amik> [1] recv: recv=0, ack=0, nack=0, write=0, wack=0, read=0, rreply=0,
amik> discard=0
amik> [1] recv: put_addr=32, get_addr=32
amik> [1] recv: error=0(0), data=0:0(0) 1:0(0) 2:0(0) 3:0(0) 4:0(0) 5:0(0)
amik> 6:0(0) 7:0(0) 
amik> [1] send: send=0, ack=0, nack=0, write=0, wack=0, read=0, rreply=0,
amik> resend=0,0
amik> [1] send: request=0, disable=ffffffff, deactivated=ffffffff,
amik> last_write=(0, 0), last_read=(0, 0)
amik> [1] send: putp=5, getp=1, relp=1
amik> [1] send: error=0(0), data=0:1(1) 1:0(0) 2:0(0) 3:4705(1261)
amik> 4:138383360(83f9000) 5:524288(80000) 6:0(0) 7:0(0) 
amik> [1] retry: count=0, putp=0, getp=0, request=0
amik> 
amik> [1] reply: putp=0, getp=0, request=0
amik> 
amik> waiting ack
amik> 
amik> waiting to send
amik> [1]   1: type=VREAD_REQUEST, dest=1, seq=0x0, rmt_proc=4705,
amik> rmt_addr=83f9000, loc_proc=0, loc_addr=0, length=80000, retry=0
amik> [1]   2: type=VREAD_REQUEST, dest=1, seq=0x0, rmt_proc=4705,
amik> rmt_addr=83f9000, loc_proc=0, loc_addr=0, length=80000, retry=0
amik> [1]   3: type=VREAD_REQUEST, dest=1, seq=0x0, rmt_proc=4705,
amik> rmt_addr=83f9000, loc_proc=0, loc_addr=0, length=80000, retry=0
amik> [1]   4: type=VREAD_REQUEST, dest=1, seq=0x0, rmt_proc=4705,
amik> rmt_addr=83f9000, loc_proc=0, loc_addr=0, length=80000, retry=0
amik> 
amik> message ack info
amik> [1] REL-MSG:(1) sack=0, rack=1, stat=8 
amik> rma ack info
amik> pmRead: Invalid argument(22)
amik> 
amik> 
amik> > 2) HPL test with setting BCASTs to 0 and 1 and N=100000
amik> >   1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
amik> > 
amik> > Shinji.
amik> 
amik> Ok I will run the tests back to you in 1-2 hours with the results.
amik> 
amik> 
amik> Thank you very much for your help,
amik> 
amik> Amik St-Cyr
amik> 
amik> 
amik> > 
amik> > From: Amik St-Cyr CFD Lab <amik at cfdlab.mcgill.ca>
amik> > Subject: Re: [SCore-users] Bug found!?
amik> > Date: 26 Sep 2002 10:17:35 -0400
amik> > Message-ID: <1033049855.3499.166.camel at stan.cfdlab.mcgill.ca>
amik> > 
amik> > amik> On Thu, 2002-09-26 at 01:51, Shinji Sumimoto wrote:
amik> > amik> > Hi.
amik> > amik> > 
amik> > amik> > From: Amik St-Cyr CFD Lab <amik at cfdlab.mcgill.ca>
amik> > amik> > Subject: Re: [SCore-users] Bug found!?
amik> > amik> > Date: 25 Sep 2002 09:22:17 -0400
amik> > amik> > Message-ID: <1032960138.3499.62.camel at stan.cfdlab.mcgill.ca>
amik> > amik> > 
amik> > amik> > amik> On Tue, 2002-09-24 at 20:37, Shinji Sumimoto wrote:
amik> > amik> > amik> > Hi.
amik> > amik> > amik> > 
amik> > amik> > amik> > This situation is very quorious.
amik> > amik> > amik> > 
amik> > amik> > amik> > The message says some messages are lost without CRC error.  If this
amik> > amik> > amik> > situation is true, something wrong with SCore MPI or PM/Myrinet.
amik> > amik> > amik> > 
amik> > amik> > amik> 
amik> > amik> > amik> For 80% of 384Gigs the maximal matrix size is about N=203000.
amik> > amik> > amik> I have tried N=200000 with a memory fill per node of 2.4 Gigs/3 Gigs.
amik> > amik> > amik> 
amik> > amik> > amik> > Your Ns is 120000. How much does your cluster have?
amik> > amik> > amik> > No swapping out memory?
amik> > amik> > amik> > 
amik> > amik> > amik> > Could you use /opt/score/share/lanai/lanaiM2k-safe.mcp instead of 
amik> > amik> > amik> > /opt/score/share/lanai/lanaiM2k.mcp(default).
amik> > amik> > amik> > 
amik> > amik> > amik> 
amik> > amik> > amik> Yes for N=120000 It takes forever with lanaiM2K-safe. Once on a 
amik> > amik> > amik> good day the system managed to do N=100000 with a 347GF score. 
amik> > amik> > amik> What we dont have is stability of the overall system.
amik> > amik> > amik> 
amik> > amik> > amik> I have tried N=120000 last night and the system had not finished
amik> > amik> > amik> this morning.
amik> > amik> > amik> 
amik> > amik> > amik> I will retry N=100000 with lanaiM2k-safe (it took 33min when we were
amik> > amik> > amik> using lanaiM2k).
amik> > amik> > 
amik> > amik> > Same situation...
amik> > amik> > 
amik> > amik> > Have you restarted scoreboard? 
amik> > amik> > # /etc/rc.d/init.d/scoreboard reload
amik> > amik> > 
amik> > amik> > I would like to separate whether this problem depends on MPICH or PM/Myrinet.
amik> > amik> > So, could you test HPL on following patterns? 
amik> > amik> > 
amik> > amik> > 1) set BCASTs to 0 and 1
amik> > amik> >   1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
amik> > amik> > 
amik> > amik> 
amik> > amik> ---------------------
amik> > amik> rg
amik> > amik> ---------------------
amik> > amik> 
amik> > amik> 
amik> > amik> | amik at stokes 09:01:14 Linux_ATHLON_CBLAS> scrun ./xhpl 
amik> > amik> SCore-D 5.0.1 connected.
amik> > amik> <0:0> SCORE: 256 nodes (128x2) ready.
amik> > amik> ============================================================================
amik> > amik> HPLinpack 1.0  --  High-Performance Linpack benchmark  --  September 27,
amik> > amik> 2000
amik> > amik> Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs., 
amik> > amik> UTK
amik> > amik> ============================================================================
amik> > amik> 
amik> > amik> An explanation of the input/output parameters follows:
amik> > amik> T/V    : Wall time / encoded variant.
amik> > amik> N      : The order of the coefficient matrix A.
amik> > amik> NB     : The partitioning blocking factor.
amik> > amik> P      : The number of process rows.
amik> > amik> Q      : The number of process columns.
amik> > amik> Time   : Time in seconds to solve the linear system.
amik> > amik> Gflops : Rate of execution for solving the linear system.
amik> > amik> 
amik> > amik> The following parameter values will be used:
amik> > amik> 
amik> > amik> N      :   10000 
amik> > amik> NB     :     100 
amik> > amik> P      :      16 
amik> > amik> Q      :      16 
amik> > amik> PFACT  :   Crout 
amik> > amik> NBMIN  :       1 
amik> > amik> NDIV   :      16 
amik> > amik> RFACT  :   Right 
amik> > amik> BCAST  :   1ring 
amik> > amik> DEPTH  :       1 
amik> > amik> SWAP   : Mix (threshold = 16)
amik> > amik> L1     : transposed form
amik> > amik> U      : transposed form
amik> > amik> EQUIL  : yes
amik> > amik> ALIGN  : 8 double precision words
amik> > amik> 
amik> > amik> ----------------------------------------------------------------------------
amik> > amik> 
amik> > amik> - The matrix A is randomly generated for each test.
amik> > amik> - The following scaled residual checks will be computed:
amik> > amik>    1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
amik> > amik>    2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
amik> > amik>    3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
amik> > amik> - The relative machine precision (eps) is taken to be         
amik> > amik> 2.220446e-16
amik> > amik> - Computational tests pass if scaled residuals are less than          
amik> > amik> 16.0
amik> > amik> 
amik> > amik> ============================================================================
amik> > amik> T/V                N    NB     P     Q               Time            
amik> > amik> Gflops
amik> > amik> ----------------------------------------------------------------------------
amik> > amik> W10R16C1        10000   100    16    16               8.10         8.234e+01
amik> > amik> ----------------------------------------------------------------------------
amik> > amik> ||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0285008 ......
amik> > amik> PASSED
amik> > amik> ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0067441 ......
amik> > amik> PASSED
amik> > amik> ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0015075 ......
amik> > amik> PASSED
amik> > amik> ============================================================================
amik> > amik> 
amik> > amik> Finished      1 tests with the following results:
amik> > amik>               1 tests completed and passed residual checks,
amik> > amik>               0 tests completed and failed residual checks,
amik> > amik>               0 tests skipped because of illegal input values.
amik> > amik> ----------------------------------------------------------------------------
amik> > amik> 
amik> > amik> End of Tests.
amik> > amik> ============================================================================
amik> > amik> 
amik> > amik> ----------------------
amik> > amik> rM
amik> > amik> ----------------------
amik> > amik> 
amik> > amik> 
amik> > amik> | amik at stokes 09:02:12 Linux_ATHLON_CBLAS> scrun ./xhpl 
amik> > amik> SCore-D 5.0.1 connected.
amik> > amik> <0:0> SCORE: 256 nodes (128x2) ready.
amik> > amik> ============================================================================
amik> > amik> HPLinpack 1.0  --  High-Performance Linpack benchmark  --  September 27,
amik> > amik> 2000
amik> > amik> Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs., 
amik> > amik> UTK
amik> > amik> ============================================================================
amik> > amik> 
amik> > amik> An explanation of the input/output parameters follows:
amik> > amik> T/V    : Wall time / encoded variant.
amik> > amik> N      : The order of the coefficient matrix A.
amik> > amik> NB     : The partitioning blocking factor.
amik> > amik> P      : The number of process rows.
amik> > amik> Q      : The number of process columns.
amik> > amik> Time   : Time in seconds to solve the linear system.
amik> > amik> Gflops : Rate of execution for solving the linear system.
amik> > amik> 
amik> > amik> The following parameter values will be used:
amik> > amik> 
amik> > amik> N      :   10000 
amik> > amik> NB     :     100 
amik> > amik> P      :      16 
amik> > amik> Q      :      16 
amik> > amik> PFACT  :   Crout 
amik> > amik> NBMIN  :       1 
amik> > amik> NDIV   :      16 
amik> > amik> RFACT  :   Right 
amik> > amik> BCAST  :  1ringM 
amik> > amik> DEPTH  :       1 
amik> > amik> SWAP   : Mix (threshold = 16)
amik> > amik> L1     : transposed form
amik> > amik> U      : transposed form
amik> > amik> EQUIL  : yes
amik> > amik> ALIGN  : 8 double precision words
amik> > amik> 
amik> > amik> ----------------------------------------------------------------------------
amik> > amik> 
amik> > amik> - The matrix A is randomly generated for each test.
amik> > amik> - The following scaled residual checks will be computed:
amik> > amik>    1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
amik> > amik>    2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
amik> > amik>    3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
amik> > amik> - The relative machine precision (eps) is taken to be         
amik> > amik> 2.220446e-16
amik> > amik> - Computational tests pass if scaled residuals are less than          
amik> > amik> 16.0
amik> > amik> 
amik> > amik> ============================================================================
amik> > amik> T/V                N    NB     P     Q               Time            
amik> > amik> Gflops
amik> > amik> ----------------------------------------------------------------------------
amik> > amik> W11R16C1        10000   100    16    16               9.24         
amik> > amik> 7.220e+01
amik> > amik> ----------------------------------------------------------------------------
amik> > amik> ||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0256112 ......
amik> > amik> PASSED
amik> > amik> ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0060604 ......
amik> > amik> PASSED
amik> > amik> ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0013546 ......
amik> > amik> PASSED
amik> > amik> ============================================================================
amik> > amik> 
amik> > amik> Finished      1 tests with the following results:
amik> > amik>               1 tests completed and passed residual checks,
amik> > amik>               0 tests completed and failed residual checks,
amik> > amik>               0 tests skipped because of illegal input values.
amik> > amik> ----------------------------------------------------------------------------
amik> > amik> 
amik> > amik> End of Tests.
amik> > amik> 
amik> > amik> ---------------------
amik> > amik> 2rM
amik> > amik> ---------------------
amik> > amik> | amik at stokes 09:04:14 Linux_ATHLON_CBLAS> scrun ./xhpl 
amik> > amik> SCore-D 5.0.1 connected.
amik> > amik> <0:0> SCORE: 256 nodes (128x2) ready.
amik> > amik> ============================================================================
amik> > amik> HPLinpack 1.0  --  High-Performance Linpack benchmark  --  September 27,
amik> > amik> 2000
amik> > amik> Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs., 
amik> > amik> UTK
amik> > amik> ============================================================================
amik> > amik> 
amik> > amik> An explanation of the input/output parameters follows:
amik> > amik> T/V    : Wall time / encoded variant.
amik> > amik> N      : The order of the coefficient matrix A.
amik> > amik> NB     : The partitioning blocking factor.
amik> > amik> P      : The number of process rows.
amik> > amik> Q      : The number of process columns.
amik> > amik> Time   : Time in seconds to solve the linear system.
amik> > amik> Gflops : Rate of execution for solving the linear system.
amik> > amik> 
amik> > amik> The following parameter values will be used:
amik> > amik> 
amik> > amik> N      :   10000 
amik> > amik> NB     :     100 
amik> > amik> P      :      16 
amik> > amik> Q      :      16 
amik> > amik> PFACT  :   Crout 
amik> > amik> NBMIN  :       1 
amik> > amik> NDIV   :      16 
amik> > amik> RFACT  :   Right 
amik> > amik> BCAST  :  2ringM 
amik> > amik> DEPTH  :       1 
amik> > amik> SWAP   : Mix (threshold = 16)
amik> > amik> L1     : transposed form
amik> > amik> U      : transposed form
amik> > amik> EQUIL  : yes
amik> > amik> ALIGN  : 8 double precision words
amik> > amik> 
amik> > amik> ----------------------------------------------------------------------------
amik> > amik> 
amik> > amik> - The matrix A is randomly generated for each test.
amik> > amik> - The following scaled residual checks will be computed:
amik> > amik>    1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
amik> > amik>    2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
amik> > amik>    3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
amik> > amik> - The relative machine precision (eps) is taken to be         
amik> > amik> 2.220446e-16
amik> > amik> - Computational tests pass if scaled residuals are less than          
amik> > amik> 16.0
amik> > amik> 
amik> > amik> ============================================================================
amik> > amik> T/V                N    NB     P     Q               Time            
amik> > amik> Gflops
amik> > amik> ----------------------------------------------------------------------------
amik> > amik> W13R16C1        10000   100    16    16               9.60         
amik> > amik> 6.947e+01
amik> > amik> ----------------------------------------------------------------------------
amik> > amik> ||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0287260 ......
amik> > amik> PASSED
amik> > amik> ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0067974 ......
amik> > amik> PASSED
amik> > amik> ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0015194 ......
amik> > amik> PASSED
amik> > amik> ============================================================================
amik> > amik> 
amik> > amik> Finished      1 tests with the following results:
amik> > amik>               1 tests completed and passed residual checks,
amik> > amik>               0 tests completed and failed residual checks,
amik> > amik>               0 tests skipped because of illegal input values.
amik> > amik> ----------------------------------------------------------------------------
amik> > amik> 
amik> > amik> End of Tests.
amik> > amik> ============================================================================
amik> > amik> 
amik> > amik> 
amik> > amik> > 2) with mpi_zerocopy without mpi_eager option
amik> > amik> > 
amik> > amik> > scrun -nodes=128x2,mpi_zerocopy=on ./xhpl 
amik> > amik> > 
amik> > amik> 
amik> > amik> ----------------------
amik> > amik> rM
amik> > amik> ----------------------
amik> > amik> 
amik> > amik> | amik at stokes 09:05:08 Linux_ATHLON_CBLAS> scrun
amik> > amik> -nodes=128x2,mpi_zerocopy=on ./xhpl 
amik> > amik> SCore-D 5.0.1 connected.
amik> > amik> <0:0> SCORE: 256 nodes (128x2) ready.
amik> > amik> ============================================================================
amik> > amik> HPLinpack 1.0  --  High-Performance Linpack benchmark  --  September 27,
amik> > amik> 2000
amik> > amik> Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs., 
amik> > amik> UTK
amik> > amik> ============================================================================
amik> > amik> 
amik> > amik> An explanation of the input/output parameters follows:
amik> > amik> T/V    : Wall time / encoded variant.
amik> > amik> N      : The order of the coefficient matrix A.
amik> > amik> NB     : The partitioning blocking factor.
amik> > amik> P      : The number of process rows.
amik> > amik> Q      : The number of process columns.
amik> > amik> Time   : Time in seconds to solve the linear system.
amik> > amik> Gflops : Rate of execution for solving the linear system.
amik> > amik> 
amik> > amik> The following parameter values will be used:
amik> > amik> 
amik> > amik> N      :   10000 
amik> > amik> NB     :     100 
amik> > amik> P      :      16 
amik> > amik> Q      :      16 
amik> > amik> PFACT  :   Crout 
amik> > amik> NBMIN  :       1 
amik> > amik> NDIV   :      16 
amik> > amik> RFACT  :   Right 
amik> > amik> BCAST  :   1ring 
amik> > amik> DEPTH  :       1 
amik> > amik> SWAP   : Mix (threshold = 16)
amik> > amik> L1     : transposed form
amik> > amik> U      : transposed form
amik> > amik> EQUIL  : yes
amik> > amik> ALIGN  : 8 double precision words
amik> > amik> 
amik> > amik> ----------------------------------------------------------------------------
amik> > amik> 
amik> > amik> - The matrix A is randomly generated for each test.
amik> > amik> - The following scaled residual checks will be computed:
amik> > amik>    1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
amik> > amik>    2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
amik> > amik>    3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
amik> > amik> - The relative machine precision (eps) is taken to be         
amik> > amik> 2.220446e-16
amik> > amik> - Computational tests pass if scaled residuals are less than          
amik> > amik> 16.0
amik> > amik> 
amik> > amik> [117] Receive Error: myriIsReadDone: Bad address(14)(1000e): 000011d0
amik> > amik> 00000041 0000c468 00001718 00007080 00001180 000052a8 08978000 
amik> > amik> [117] Receive Error: myriIsReadDone: Bad address(14)(1000e): 000011d0
amik> > amik> 00000041 0000c468 00007080 00007080 000011d0 00005968 0000c000 
amik> > amik> comp_is_read_done(0x85bc7a0): pmIsReadDone(0x85bec48): Bad address(14)
amik> > amik> <117:1> SCORE:WARNING MPICH/SCore: pmIsReadDone(pmc=0x85bc7a0) failed,
amik> > amik> errno=14
amik> > amik> <117:1> SCORE:PANIC MPICH/SCore: critical error on message transfer
amik> > amik> <117:1> Trying to attach GDB (DISPLAY=localhost:10.0): PANIC
amik> > amik> comp_is_read_done(0x85bc7a0): pmIsReadDone(0x85bec48): Bad address(14)
amik> > amik> <117:0> SCORE:WARNING MPICH/SCore: pmIsReadDone(pmc=0x85bc7a0) failed,
amik> > amik> errno=14
amik> > amik> <117:0> SCORE:PANIC MPICH/SCore: critical error on message transfer
amik> > amik> <117:0> Trying to attach GDB (DISPLAY=localhost:10.0): PANIC
amik> > amik> SCORE: Program aborted.
amik> > 
amik> > This error shows that PM/Myrinet firmware on Myrinet NIC failed to
amik> > read page table of destination user space.
amik> > 
amik> > If this error is occurred on node #117 (cn118?), this may be hardware
amik> > problem on the node. 
amik> > 
amik> > amik>cn118
amik> > amik>8	759960
amik> > 
amik> > amik> ----------------------
amik> > amik> rM
amik> > amik> ----------------------
amik> > amik> 
amik> > amik> 
amik> > amik> | amik at stokes 09:06:31 Linux_ATHLON_CBLAS> scrun
amik> > amik> -nodes=128x2,mpi_zerocopy=on ./xhpl 
amik> > amik> SCore-D 5.0.1 connected.
amik> > amik> <0:0> SCORE: 256 nodes (128x2) ready.
amik> > amik> ============================================================================
amik> > amik> HPLinpack 1.0  --  High-Performance Linpack benchmark  --  September 27,
amik> > amik> 2000
amik> > amik> Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs., 
amik> > amik> UTK
amik> > amik> ============================================================================
amik> > amik> 
amik> > amik> An explanation of the input/output parameters follows:
amik> > amik> T/V    : Wall time / encoded variant.
amik> > amik> N      : The order of the coefficient matrix A.
amik> > amik> NB     : The partitioning blocking factor.
amik> > amik> P      : The number of process rows.
amik> > amik> Q      : The number of process columns.
amik> > amik> Time   : Time in seconds to solve the linear system.
amik> > amik> Gflops : Rate of execution for solving the linear system.
amik> > amik> 
amik> > amik> The following parameter values will be used:
amik> > amik> 
amik> > amik> N      :   10000 
amik> > amik> NB     :     100 
amik> > amik> P      :      16 
amik> > amik> Q      :      16 
amik> > amik> PFACT  :   Crout 
amik> > amik> NBMIN  :       1 
amik> > amik> NDIV   :      16 
amik> > amik> RFACT  :   Right 
amik> > amik> BCAST  :  1ringM 
amik> > amik> DEPTH  :       1 
amik> > amik> SWAP   : Mix (threshold = 16)
amik> > amik> L1     : transposed form
amik> > amik> U      : transposed form
amik> > amik> EQUIL  : yes
amik> > amik> ALIGN  : 8 double precision words
amik> > amik> 
amik> > amik> ----------------------------------------------------------------------------
amik> > amik> 
amik> > amik> - The matrix A is randomly generated for each test.
amik> > amik> - The following scaled residual checks will be computed:
amik> > amik>    1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
amik> > amik>    2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
amik> > amik>    3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
amik> > amik> - The relative machine precision (eps) is taken to be         
amik> > amik> 2.220446e-16
amik> > amik> - Computational tests pass if scaled residuals are less than          
amik> > amik> 16.0
amik> > amik> 
amik> > amik> ============================================================================
amik> > amik> T/V                N    NB     P     Q               Time            
amik> > amik> Gflops
amik> > amik> ----------------------------------------------------------------------------
amik> > amik> W11R16C1        10000   100    16    16              10.18         
amik> > amik> 6.550e+01
amik> > amik> ----------------------------------------------------------------------------
amik> > amik> ||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0300470 ......
amik> > amik> PASSED
amik> > amik> ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0071100 ......
amik> > amik> PASSED
amik> > amik> ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0015892 ......
amik> > amik> PASSED
amik> > amik> ============================================================================
amik> > amik> 
amik> > amik> Finished      1 tests with the following results:
amik> > amik>               1 tests completed and passed residual checks,
amik> > amik>               0 tests completed and failed residual checks,
amik> > amik>               0 tests skipped because of illegal input values.
amik> > amik> ----------------------------------------------------------------------------
amik> > amik> 
amik> > amik> End of Tests.
amik> > amik> ============================================================================
amik> > amik> 
amik> > amik> 
amik> > amik> > 
amik> > amik> > When these tests do not help with analysis of the situation, 
amik> > amik> > I will send special Myrinet control programs.
amik> > amik> > 
amik> > amik> > By the way, you are using Athlon processor, so
amik> > amik> > 
amik> > amik> > What chipset does the MB have? Early 760MP-X chipset has very quorious
amik> > amik> > situation on zero-copy communication.
amik> > amik> > 
amik> > amik> 
amik> > amik> Yes Indeed, we have those.
amik> > amik> 
amik> > amik> > If you have a time, I recommend bustest (PCI DMA bandwidth performance
amik> > amik> > measurement) and communication test of PM message and zero-copy
amik> > amik> > communication .
amik> > amik> > 
amik> > amik> 
amik> > amik> We have time, we want this project to work. We will do, I will ask our
amik> > amik> technician to do the testing.
amik> > amik> 
amik> > amik> PM/Myrinet Test Procedure:
amik> > amik> 1)
amik> > amik> myrinet
amik> > amik> 
amik> > amik> | amik at stokes 09:24:26 sbin> ./rpminit cn1 myrinet
amik> > amik> | amik at stokes 09:27:13 sbin> ./rpmtest cn1 myrinet -dest 0 -ping
amik> > amik> 8	5.8716e-06
amik> > amik> 
amik> > amik> myrinet2k
amik> > amik> | amik at stokes 10:14:19 sbin> ./rpminit cn1 myrinet2k
amik> > amik> | amik at stokes 10:14:24 sbin> ./rpmtest cn1 myrinet2k -dest 0 -ping
amik> > amik> 8	7.49937e-06
amik> > amik> 
amik> > amik> (see 3 (slow node 61))
amik> > amik> | amik at stokes 10:15:46 sbin> ./rpminit cn61 myrinet2k
amik> > amik> | amik at stokes 10:15:53 sbin> ./rpmtest cn61 myrinet2k -dest 60 -ping
amik> > amik> 8	1.03231e-05
amik> > amik> 
amik> > amik> 2)
amik> > amik> 
amik> > amik> Myrinet:
amik> > amik> 
amik> > amik> Shell A:
amik> > amik> | amik at stokes 09:29:47 sbin> ./rpmtest cn2 myrinet -reply    
amik> > amik> 
amik> > amik> Shell B:
amik> > amik> | amik at stokes 09:30:57 sbin> ./rpmtest cn1 myrinet -dest 1 -ping 
amik> > amik> 8	1.00363e-05
amik> > amik> 
amik> > amik> Myrinet2k:
amik> > amik> Shell A:
amik> > amik> ./rpmtest cn2 myrinet2k -reply
amik> > amik> 
amik> > amik> Shell B:
amik> > amik> | amik at stokes 10:13:14 sbin> ./rpmtest cn1 myrinet2k -dest 1 -ping
amik> > amik> 8	1.20059e-05
amik> > amik> 
amik> > amik> (see 3 (slow node 61))
amik> > amik> Shell A:
amik> > amik> | amik at stokes 10:18:09 sbin> ./rpmtest cn61 myrinet2k -reply 
amik> > amik> 
amik> > amik> Shell B:
amik> > amik> | amik at stokes 10:18:27 sbin> ./rpmtest cn1 myrinet2k -dest 60 -ping
amik> > amik> 8	1.4752e-05
amik> > amik> 
amik> > amik> 
amik> > amik> 3)
amik> > amik> Shell A:
amik> > amik> | amik at stokes 09:56:39 sbin> ./rpmtest cn3 myrinet2k -vreply
amik> > amik> 
amik> > amik> Shell B:
amik> > amik> | amik at stokes 10:18:33 sbin> cat ~/bin/do_all 
amik> > amik> #!/bin/bash
amik> > amik> 
amik> > amik> liste=`(cat /etc/hosts | grep cn | awk '{print $3}')`
amik> > amik> 
amik> > amik> for i in $liste;do
amik> > amik> echo $i
amik> > amik> /opt/score/sbin/rpmtest $i myrinet2k -dest $1 -vwrite
amik> > amik> done
amik> > amik> 
amik> > amik> 
amik> > amik> | amik at stokes 09:53:47 sbin> do_all 2
amik> > amik> cn1
amik> > amik> 8	1.21528e+06
amik> > amik> cn2
amik> > amik> 8	1.21555e+06
amik> > amik> cn3
amik> > amik> myriMapLANai(0, 0xbffff794, 0): pm_open("/dev/pmmyri/0", O_RDWR, 0):
amik> > amik> Device or resource busy(16)
amik> > amik> myriOpenDevice("/var/scored/scoreboard/stokes.0000Z000C-WE",
amik> > amik> "/var/scored/scoreboard/stokes.0000Z0006MGS", 0xbffff7d8):
amik> > amik> myriMapLANai(0, 0xbffff794): Device or resource busy(16)
amik> > amik> myri_open_device(0, 0xbffff9c0, 0x80adc80):
amik> > amik> myriOpenDevice("/var/scored/scoreboard/stokes.0000Z000C-WE",
amik> > amik> "/var/scored/scoreboard/stokes.0000Z0006MGS", 0xbffff7d8): Device or
amik> > amik> resource busy(16)
amik> > amik> pmOpenDevice: Device or resource busy(16)
amik> > amik> cn4
amik> > amik> 8	1.22084e+06
amik> > amik> cn5
amik> > amik> 8	1.22031e+06
amik> > amik> cn6
amik> > amik> 8	1.21936e+06
amik> > amik> cn7
amik> > amik> 8	1.21954e+06
amik> > amik> cn8
amik> > amik> 8	1.21908e+06
amik> > amik> cn9
amik> > amik> 8	1.17938e+06
amik> > amik> cn10
amik> > amik> 8	1.18443e+06
amik> > amik> cn11
amik> > amik> 8	1.18103e+06
amik> > amik> cn12
amik> > amik> 8	1.17963e+06
amik> > amik> cn13
amik> > amik> 8	1.18098e+06
amik> > amik> cn14
amik> > amik> 8	1.20557e+06
amik> > amik> cn15
amik> > amik> 8	1.17934e+06
amik> > amik> cn16
amik> > amik> 8	1.17928e+06
amik> > amik> cn17
amik> > amik> 8	1.17943e+06
amik> > amik> cn18
amik> > amik> 8	1.17936e+06
amik> > amik> cn19
amik> > amik> 8	1.17979e+06
amik> > amik> cn20
amik> > amik> 8	1.18086e+06
amik> > amik> cn21
amik> > amik> 8	1.17987e+06
amik> > amik> cn22
amik> > amik> 8	1.17724e+06
amik> > amik> cn23
amik> > amik> 8	1.18379e+06
amik> > amik> cn24
amik> > amik> 8	1.18124e+06
amik> > amik> cn25
amik> > amik> 8	1.17841e+06
amik> > amik> cn26
amik> > amik> 8	1.19677e+06
amik> > amik> cn27
amik> > amik> 8	1.17885e+06
amik> > amik> cn28
amik> > amik> 8	1.18305e+06
amik> > amik> cn29
amik> > amik> 8	1.18064e+06
amik> > amik> cn30
amik> > amik> 8	1.17977e+06
amik> > amik> cn31
amik> > amik> 8	1.17963e+06
amik> > amik> cn32
amik> > amik> 8	1.17994e+06
amik> > amik> cn33
amik> > amik> 8	1.17912e+06
amik> > amik> cn34
amik> > amik> 8	1.19678e+06
amik> > amik> cn35
amik> > amik> 8	1.20869e+06
amik> > amik> cn36
amik> > amik> 8	1.18085e+06
amik> > amik> cn37
amik> > amik> 8	1.18299e+06
amik> > amik> cn38
amik> > amik> 8	1.17977e+06
amik> > amik> cn39
amik> > amik> 8	1.18088e+06
amik> > amik> cn40
amik> > amik> 8	1.18038e+06
amik> > amik> cn41
amik> > amik> 8	1.17923e+06
amik> > amik> cn42
amik> > amik> 8	1.17943e+06
amik> > amik> cn43
amik> > amik> 8	1.18067e+06
amik> > amik> cn44
amik> > amik> 8	1.18168e+06
amik> > amik> cn45
amik> > amik> 8	1.18139e+06
amik> > amik> cn46
amik> > amik> 8	1.18099e+06
amik> > amik> cn47
amik> > amik> 8	1.18203e+06
amik> > amik> cn48
amik> > amik> 8	1.18082e+06
amik> > amik> cn49
amik> > amik> 8	1.17946e+06
amik> > amik> cn50
amik> > amik> 8	1.1803e+06
amik> > amik> cn51
amik> > amik> 8	1.1804e+06
amik> > amik> cn52
amik> > amik> 8	1.18191e+06
amik> > amik> cn53
amik> > amik> 8	1.18401e+06
amik> > amik> cn54
amik> > amik> 8	1.18331e+06
amik> > amik> cn55
amik> > amik> 8	1.18326e+06
amik> > amik> cn56
amik> > amik> 8	1.18173e+06
amik> > amik> cn57
amik> > amik> 8	1.18018e+06
amik> > amik> cn58
amik> > amik> 8	1.18091e+06
amik> > amik> cn59
amik> > amik> 8	1.18022e+06
amik> > amik> cn60
amik> > amik> 8	1.1818e+06
amik> > amik> cn61
amik> > amik> 8	755405
amik> > amik> cn62
amik> > amik> 8	1.18278e+06
amik> > amik> cn63
amik> > amik> 8	1.20865e+06
amik> > amik> cn64
amik> > amik> 8	1.18166e+06
amik> > amik> cn65
amik> > amik> 8	1.18006e+06
amik> > amik> cn66
amik> > amik> 8	1.17926e+06
amik> > amik> cn67
amik> > amik> 8	1.17955e+06
amik> > amik> cn68
amik> > amik> 8	1.20513e+06
amik> > amik> cn69
amik> > amik> 8	1.17968e+06
amik> > amik> cn70
amik> > amik> 8	1.17986e+06
amik> > amik> cn71
amik> > amik> 8	1.18094e+06
amik> > amik> cn72
amik> > amik> 8	1.18195e+06
amik> > amik> cn73
amik> > amik> 8	1.19813e+06
amik> > amik> cn74
amik> > amik> 8	1.18053e+06
amik> > amik> cn75
amik> > amik> 8	1.17967e+06
amik> > amik> cn76
amik> > amik> 8	1.17978e+06
amik> > amik> cn77
amik> > amik> 8	1.17905e+06
amik> > amik> cn78
amik> > amik> 8	1.18084e+06
amik> > amik> cn79
amik> > amik> 8	1.18611e+06
amik> > amik> cn80
amik> > amik> 8	755487
amik> > amik> cn81
amik> > amik> 8	1.17952e+06
amik> > amik> cn82
amik> > amik> 8	1.17863e+06
amik> > amik> cn83
amik> > amik> 8	1.17988e+06
amik> > amik> cn84
amik> > amik> 8	1.20254e+06
amik> > amik> cn85
amik> > amik> 8	1.20848e+06
amik> > amik> cn86
amik> > amik> 8	1.18039e+06
amik> > amik> cn87
amik> > amik> 8	1.17919e+06
amik> > amik> cn88
amik> > amik> 8	1.17975e+06
amik> > amik> cn89
amik> > amik> 8	755277
amik> > amik> cn90
amik> > amik> 8	1.1792e+06
amik> > amik> cn91
amik> > amik> 8	1.18081e+06
amik> > amik> cn92
amik> > amik> 8	1.17974e+06
amik> > amik> cn93
amik> > amik> 8	1.17997e+06
amik> > amik> cn94
amik> > amik> 8	1.17914e+06
amik> > amik> cn95
amik> > amik> 8	1.18022e+06
amik> > amik> cn96
amik> > amik> 8	1.17897e+06
amik> > amik> cn97
amik> > amik> 8	1.18451e+06
amik> > amik> cn98
amik> > amik> 8	1.18447e+06
amik> > amik> cn99
amik> > amik> 8	1.20977e+06
amik> > amik> cn100
amik> > amik> 8	1.18888e+06
amik> > amik> cn101
amik> > amik> 8	1.18841e+06
amik> > amik> cn102
amik> > amik> 8	1.18935e+06
amik> > amik> cn103
amik> > amik> 8	1.18774e+06
amik> > amik> cn104
amik> > amik> 8	1.18739e+06
amik> > amik> cn105
amik> > amik> 8	1.20806e+06
amik> > amik> cn106
amik> > amik> 8	1.18413e+06
amik> > amik> cn107
amik> > amik> 8	1.18674e+06
amik> > amik> cn108
amik> > amik> 8	1.1868e+06
amik> > amik> cn109
amik> > amik> 8	1.21215e+06
amik> > amik> cn110
amik> > amik> 8	755543
amik> > amik> cn111
amik> > amik> 8	1.18941e+06
amik> > amik> cn112
amik> > amik> 8	1.18727e+06
amik> > amik> cn113
amik> > amik> 8	1.18388e+06
amik> > amik> cn114
amik> > amik> 8	1.18409e+06
amik> > amik> cn115
amik> > amik> 8	1.18515e+06
amik> > amik> cn116
amik> > amik> 8	1.18756e+06
amik> > amik> cn117
amik> > amik> 8	1.18788e+06
amik> > amik> cn118
amik> > amik> 8	759960
amik> > amik> cn119
amik> > amik> 8	1.2122e+06
amik> > amik> cn120
amik> > amik> 8	1.18718e+06
amik> > amik> cn121
amik> > amik> 8	1.18379e+06
amik> > amik> cn122
amik> > amik> 8	1.18409e+06
amik> > amik> cn123
amik> > amik> 8	1.18601e+06
amik> > amik> cn124
amik> > amik> 8	1.18686e+06
amik> > amik> cn125
amik> > amik> 8	1.18783e+06
amik> > amik> cn126
amik> > amik> 8	1.18639e+06
amik> > amik> cn127
amik> > amik> 8	1.18707e+06
amik> > amik> cn128
amik> > amik> 8	1.18881e+06
amik> > amik> 
amik> > amik> Nodes 61,80,89,110,118 are slow.
amik> > amik> I have repeated the test with ./rpmtest cn2 myrinet2k -vreply and
amik> > amik> the same nodes are slow : 61,80,89,110,118 (!)
amik> > amik> 
amik> > amik> 
amik> > amik> 4)
amik> > amik> 
amik> > amik> | amik at stokes 09:55:02 deploy> ./scstest -network myrinet2k
amik> > amik> SCSTEST: BURST on myrinet2k(chan=0,ctx=0,len=16)
amik> > amik> 50 K packets.
amik> > amik> 100 K packets.
amik> > amik> 150 K packets.
amik> > amik> 200 K packets.
amik> > amik> 250 K packets.
amik> > amik> 300 K packets.
amik> > amik> ...
amik> > amik> 7150 K packets.
amik> > amik> etc
amik> > amik> 
amik> > amik> Seems to work fine.
amik> > amik> 
amik> > ------
amik> > Shinji Sumimoto, Fujitsu Labs
amik> -- 
amik> _____________________________________________________
amik> Dr. A. St-Cyr
amik> Research Associate, CFD Lab
amik> Department of Mechanical Engineering
amik> McGill University
amik> 688 Sherbrooke Street West, 7th floor
amik> Montreal, Qc, Canada H3A 2S6
amik> Tel: +1 (514) 398-1710, Admin. Fax : 2203
amik> amik at cfdlab.mcgill.ca
amik> _____________________________________________________
amik> 
amik> _______________________________________________
amik> SCore-users mailing list
amik> SCore-users at pccluster.org
amik> http://www.pccluster.org/mailman/listinfo/score-users
amik> 
amik> 
------
Shinji Sumimoto, Fujitsu Labs



More information about the SCore-users mailing list