[SCore-users] Bug found!?

Amik St-Cyr CFD Lab amik at cfdlab.mcgill.ca
Fri Sep 27 00:15:28 JST 2002


On Thu, 2002-09-26 at 10:48, Shinji Sumimoto wrote:
> Hi.
> 
> Thank you very much for sending the test results.
> Could you test following sequences?
> 
> 1) communication tests:
> 
> recv$ rpmtest cn1 myrinet2k -sink
> recv$ rpmtest cn2 myrinet2k -dest 0 -busrt -len 8192

Shell A:
| amik at stokes 11:14:01 sbin> ./rpmtest cn1 myrinet2k -sink

Shell B:
| amik at stokes 11:15:03 sbin> ./rpmtest cn2 myrinet2k -dest 0 -burst -len
8192
8192	7.40536e+07


> 
> recv$ rpmtest cn1 myrinet2k -vreply
> recv$ rpmtest cn2 myrinet2k -dest 0 -vwrite -len 524288 -iter 10000
> recv$ rpmtest cn2 myrinet2k -dest 0 -vread -len 524288 -iter 10000
> 

Shell A:
| amik at stokes 11:16:33 sbin> ./rpmtest cn1 myrinet2k -vreply

Shell B:

(VWRITE)
| amik at stokes 11:17:01 sbin> ./rpmtest cn2 myrinet2k -vwrite -len 524288
-iter 10000
[1] Send Error: myriWrite: Invalid argument(22)(30016): 00000001
00000000 00000000 000011cb 083f9000 00080000 00000000 00000000 
[1] chan=0, crc=0, unknown=0, nres=0, arep=0
[1] recv: recv=0, ack=0, nack=0, write=0, wack=0, read=0, rreply=0,
discard=0
[1] recv: put_addr=32, get_addr=32
[1] recv: error=0(0), data=0:0(0) 1:0(0) 2:0(0) 3:0(0) 4:0(0) 5:0(0)
6:0(0) 7:0(0) 
[1] send: send=0, ack=0, nack=0, write=0, wack=0, read=0, rreply=0,
resend=0,0
[1] send: request=0, disable=ffffffff, deactivated=ffffffff,
last_write=(0, 0), last_read=(0, 0)
[1] send: putp=5, getp=1, relp=1
[1] send: error=0(0), data=0:1(1) 1:0(0) 2:0(0) 3:4555(11cb)
4:138383360(83f9000) 5:524288(80000) 6:0(0) 7:0(0) 
[1] retry: count=0, putp=0, getp=0, request=0

[1] reply: putp=0, getp=0, request=0

waiting ack

waiting to send
[1]   1: type=VWRITE, dest=1, seq=0x0, rmt_proc=4555, rmt_addr=83f9000,
loc_proc=0, loc_addr=0, length=80000, retry=0
[1]   2: type=VWRITE, dest=1, seq=0x0, rmt_proc=4555, rmt_addr=83f9000,
loc_proc=0, loc_addr=0, length=80000, retry=0
[1]   3: type=VWRITE, dest=1, seq=0x0, rmt_proc=4555, rmt_addr=83f9000,
loc_proc=0, loc_addr=0, length=80000, retry=0
[1]   4: type=VWRITE, dest=1, seq=0x0, rmt_proc=4555, rmt_addr=83f9000,
loc_proc=0, loc_addr=0, length=80000, retry=0

message ack info
[1] REL-MSG:(1) sack=0, rack=1, stat=8 
rma ack info
pmWrite: Invalid argument(22)

(VREAD)

| amik at stokes 11:17:27 sbin> ./rpmtest cn2 myrinet2k -vread -len 524288
-iter 10000
[1] Send Error: myriRead: Invalid argument(22)(40016): 00000001 00000000
00000000 00001261 083f9000 00080000 00000000 00000000 
[1] chan=0, crc=0, unknown=0, nres=0, arep=0
[1] recv: recv=0, ack=0, nack=0, write=0, wack=0, read=0, rreply=0,
discard=0
[1] recv: put_addr=32, get_addr=32
[1] recv: error=0(0), data=0:0(0) 1:0(0) 2:0(0) 3:0(0) 4:0(0) 5:0(0)
6:0(0) 7:0(0) 
[1] send: send=0, ack=0, nack=0, write=0, wack=0, read=0, rreply=0,
resend=0,0
[1] send: request=0, disable=ffffffff, deactivated=ffffffff,
last_write=(0, 0), last_read=(0, 0)
[1] send: putp=5, getp=1, relp=1
[1] send: error=0(0), data=0:1(1) 1:0(0) 2:0(0) 3:4705(1261)
4:138383360(83f9000) 5:524288(80000) 6:0(0) 7:0(0) 
[1] retry: count=0, putp=0, getp=0, request=0

[1] reply: putp=0, getp=0, request=0

waiting ack

waiting to send
[1]   1: type=VREAD_REQUEST, dest=1, seq=0x0, rmt_proc=4705,
rmt_addr=83f9000, loc_proc=0, loc_addr=0, length=80000, retry=0
[1]   2: type=VREAD_REQUEST, dest=1, seq=0x0, rmt_proc=4705,
rmt_addr=83f9000, loc_proc=0, loc_addr=0, length=80000, retry=0
[1]   3: type=VREAD_REQUEST, dest=1, seq=0x0, rmt_proc=4705,
rmt_addr=83f9000, loc_proc=0, loc_addr=0, length=80000, retry=0
[1]   4: type=VREAD_REQUEST, dest=1, seq=0x0, rmt_proc=4705,
rmt_addr=83f9000, loc_proc=0, loc_addr=0, length=80000, retry=0

message ack info
[1] REL-MSG:(1) sack=0, rack=1, stat=8 
rma ack info
pmRead: Invalid argument(22)


> 2) HPL test with setting BCASTs to 0 and 1 and N=100000
>   1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
> 
> Shinji.

Ok I will run the tests back to you in 1-2 hours with the results.


Thank you very much for your help,

Amik St-Cyr


> 
> From: Amik St-Cyr CFD Lab <amik at cfdlab.mcgill.ca>
> Subject: Re: [SCore-users] Bug found!?
> Date: 26 Sep 2002 10:17:35 -0400
> Message-ID: <1033049855.3499.166.camel at stan.cfdlab.mcgill.ca>
> 
> amik> On Thu, 2002-09-26 at 01:51, Shinji Sumimoto wrote:
> amik> > Hi.
> amik> > 
> amik> > From: Amik St-Cyr CFD Lab <amik at cfdlab.mcgill.ca>
> amik> > Subject: Re: [SCore-users] Bug found!?
> amik> > Date: 25 Sep 2002 09:22:17 -0400
> amik> > Message-ID: <1032960138.3499.62.camel at stan.cfdlab.mcgill.ca>
> amik> > 
> amik> > amik> On Tue, 2002-09-24 at 20:37, Shinji Sumimoto wrote:
> amik> > amik> > Hi.
> amik> > amik> > 
> amik> > amik> > This situation is very quorious.
> amik> > amik> > 
> amik> > amik> > The message says some messages are lost without CRC error.  If this
> amik> > amik> > situation is true, something wrong with SCore MPI or PM/Myrinet.
> amik> > amik> > 
> amik> > amik> 
> amik> > amik> For 80% of 384Gigs the maximal matrix size is about N=203000.
> amik> > amik> I have tried N=200000 with a memory fill per node of 2.4 Gigs/3 Gigs.
> amik> > amik> 
> amik> > amik> > Your Ns is 120000. How much does your cluster have?
> amik> > amik> > No swapping out memory?
> amik> > amik> > 
> amik> > amik> > Could you use /opt/score/share/lanai/lanaiM2k-safe.mcp instead of 
> amik> > amik> > /opt/score/share/lanai/lanaiM2k.mcp(default).
> amik> > amik> > 
> amik> > amik> 
> amik> > amik> Yes for N=120000 It takes forever with lanaiM2K-safe. Once on a 
> amik> > amik> good day the system managed to do N=100000 with a 347GF score. 
> amik> > amik> What we dont have is stability of the overall system.
> amik> > amik> 
> amik> > amik> I have tried N=120000 last night and the system had not finished
> amik> > amik> this morning.
> amik> > amik> 
> amik> > amik> I will retry N=100000 with lanaiM2k-safe (it took 33min when we were
> amik> > amik> using lanaiM2k).
> amik> > 
> amik> > Same situation...
> amik> > 
> amik> > Have you restarted scoreboard? 
> amik> > # /etc/rc.d/init.d/scoreboard reload
> amik> > 
> amik> > I would like to separate whether this problem depends on MPICH or PM/Myrinet.
> amik> > So, could you test HPL on following patterns? 
> amik> > 
> amik> > 1) set BCASTs to 0 and 1
> amik> >   1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
> amik> > 
> amik> 
> amik> ---------------------
> amik> rg
> amik> ---------------------
> amik> 
> amik> 
> amik> | amik at stokes 09:01:14 Linux_ATHLON_CBLAS> scrun ./xhpl 
> amik> SCore-D 5.0.1 connected.
> amik> <0:0> SCORE: 256 nodes (128x2) ready.
> amik> ============================================================================
> amik> HPLinpack 1.0  --  High-Performance Linpack benchmark  --  September 27,
> amik> 2000
> amik> Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs., 
> amik> UTK
> amik> ============================================================================
> amik> 
> amik> An explanation of the input/output parameters follows:
> amik> T/V    : Wall time / encoded variant.
> amik> N      : The order of the coefficient matrix A.
> amik> NB     : The partitioning blocking factor.
> amik> P      : The number of process rows.
> amik> Q      : The number of process columns.
> amik> Time   : Time in seconds to solve the linear system.
> amik> Gflops : Rate of execution for solving the linear system.
> amik> 
> amik> The following parameter values will be used:
> amik> 
> amik> N      :   10000 
> amik> NB     :     100 
> amik> P      :      16 
> amik> Q      :      16 
> amik> PFACT  :   Crout 
> amik> NBMIN  :       1 
> amik> NDIV   :      16 
> amik> RFACT  :   Right 
> amik> BCAST  :   1ring 
> amik> DEPTH  :       1 
> amik> SWAP   : Mix (threshold = 16)
> amik> L1     : transposed form
> amik> U      : transposed form
> amik> EQUIL  : yes
> amik> ALIGN  : 8 double precision words
> amik> 
> amik> ----------------------------------------------------------------------------
> amik> 
> amik> - The matrix A is randomly generated for each test.
> amik> - The following scaled residual checks will be computed:
> amik>    1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
> amik>    2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
> amik>    3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
> amik> - The relative machine precision (eps) is taken to be         
> amik> 2.220446e-16
> amik> - Computational tests pass if scaled residuals are less than          
> amik> 16.0
> amik> 
> amik> ============================================================================
> amik> T/V                N    NB     P     Q               Time            
> amik> Gflops
> amik> ----------------------------------------------------------------------------
> amik> W10R16C1        10000   100    16    16               8.10         8.234e+01
> amik> ----------------------------------------------------------------------------
> amik> ||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0285008 ......
> amik> PASSED
> amik> ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0067441 ......
> amik> PASSED
> amik> ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0015075 ......
> amik> PASSED
> amik> ============================================================================
> amik> 
> amik> Finished      1 tests with the following results:
> amik>               1 tests completed and passed residual checks,
> amik>               0 tests completed and failed residual checks,
> amik>               0 tests skipped because of illegal input values.
> amik> ----------------------------------------------------------------------------
> amik> 
> amik> End of Tests.
> amik> ============================================================================
> amik> 
> amik> ----------------------
> amik> rM
> amik> ----------------------
> amik> 
> amik> 
> amik> | amik at stokes 09:02:12 Linux_ATHLON_CBLAS> scrun ./xhpl 
> amik> SCore-D 5.0.1 connected.
> amik> <0:0> SCORE: 256 nodes (128x2) ready.
> amik> ============================================================================
> amik> HPLinpack 1.0  --  High-Performance Linpack benchmark  --  September 27,
> amik> 2000
> amik> Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs., 
> amik> UTK
> amik> ============================================================================
> amik> 
> amik> An explanation of the input/output parameters follows:
> amik> T/V    : Wall time / encoded variant.
> amik> N      : The order of the coefficient matrix A.
> amik> NB     : The partitioning blocking factor.
> amik> P      : The number of process rows.
> amik> Q      : The number of process columns.
> amik> Time   : Time in seconds to solve the linear system.
> amik> Gflops : Rate of execution for solving the linear system.
> amik> 
> amik> The following parameter values will be used:
> amik> 
> amik> N      :   10000 
> amik> NB     :     100 
> amik> P      :      16 
> amik> Q      :      16 
> amik> PFACT  :   Crout 
> amik> NBMIN  :       1 
> amik> NDIV   :      16 
> amik> RFACT  :   Right 
> amik> BCAST  :  1ringM 
> amik> DEPTH  :       1 
> amik> SWAP   : Mix (threshold = 16)
> amik> L1     : transposed form
> amik> U      : transposed form
> amik> EQUIL  : yes
> amik> ALIGN  : 8 double precision words
> amik> 
> amik> ----------------------------------------------------------------------------
> amik> 
> amik> - The matrix A is randomly generated for each test.
> amik> - The following scaled residual checks will be computed:
> amik>    1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
> amik>    2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
> amik>    3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
> amik> - The relative machine precision (eps) is taken to be         
> amik> 2.220446e-16
> amik> - Computational tests pass if scaled residuals are less than          
> amik> 16.0
> amik> 
> amik> ============================================================================
> amik> T/V                N    NB     P     Q               Time            
> amik> Gflops
> amik> ----------------------------------------------------------------------------
> amik> W11R16C1        10000   100    16    16               9.24         
> amik> 7.220e+01
> amik> ----------------------------------------------------------------------------
> amik> ||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0256112 ......
> amik> PASSED
> amik> ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0060604 ......
> amik> PASSED
> amik> ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0013546 ......
> amik> PASSED
> amik> ============================================================================
> amik> 
> amik> Finished      1 tests with the following results:
> amik>               1 tests completed and passed residual checks,
> amik>               0 tests completed and failed residual checks,
> amik>               0 tests skipped because of illegal input values.
> amik> ----------------------------------------------------------------------------
> amik> 
> amik> End of Tests.
> amik> 
> amik> ---------------------
> amik> 2rM
> amik> ---------------------
> amik> | amik at stokes 09:04:14 Linux_ATHLON_CBLAS> scrun ./xhpl 
> amik> SCore-D 5.0.1 connected.
> amik> <0:0> SCORE: 256 nodes (128x2) ready.
> amik> ============================================================================
> amik> HPLinpack 1.0  --  High-Performance Linpack benchmark  --  September 27,
> amik> 2000
> amik> Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs., 
> amik> UTK
> amik> ============================================================================
> amik> 
> amik> An explanation of the input/output parameters follows:
> amik> T/V    : Wall time / encoded variant.
> amik> N      : The order of the coefficient matrix A.
> amik> NB     : The partitioning blocking factor.
> amik> P      : The number of process rows.
> amik> Q      : The number of process columns.
> amik> Time   : Time in seconds to solve the linear system.
> amik> Gflops : Rate of execution for solving the linear system.
> amik> 
> amik> The following parameter values will be used:
> amik> 
> amik> N      :   10000 
> amik> NB     :     100 
> amik> P      :      16 
> amik> Q      :      16 
> amik> PFACT  :   Crout 
> amik> NBMIN  :       1 
> amik> NDIV   :      16 
> amik> RFACT  :   Right 
> amik> BCAST  :  2ringM 
> amik> DEPTH  :       1 
> amik> SWAP   : Mix (threshold = 16)
> amik> L1     : transposed form
> amik> U      : transposed form
> amik> EQUIL  : yes
> amik> ALIGN  : 8 double precision words
> amik> 
> amik> ----------------------------------------------------------------------------
> amik> 
> amik> - The matrix A is randomly generated for each test.
> amik> - The following scaled residual checks will be computed:
> amik>    1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
> amik>    2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
> amik>    3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
> amik> - The relative machine precision (eps) is taken to be         
> amik> 2.220446e-16
> amik> - Computational tests pass if scaled residuals are less than          
> amik> 16.0
> amik> 
> amik> ============================================================================
> amik> T/V                N    NB     P     Q               Time            
> amik> Gflops
> amik> ----------------------------------------------------------------------------
> amik> W13R16C1        10000   100    16    16               9.60         
> amik> 6.947e+01
> amik> ----------------------------------------------------------------------------
> amik> ||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0287260 ......
> amik> PASSED
> amik> ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0067974 ......
> amik> PASSED
> amik> ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0015194 ......
> amik> PASSED
> amik> ============================================================================
> amik> 
> amik> Finished      1 tests with the following results:
> amik>               1 tests completed and passed residual checks,
> amik>               0 tests completed and failed residual checks,
> amik>               0 tests skipped because of illegal input values.
> amik> ----------------------------------------------------------------------------
> amik> 
> amik> End of Tests.
> amik> ============================================================================
> amik> 
> amik> 
> amik> > 2) with mpi_zerocopy without mpi_eager option
> amik> > 
> amik> > scrun -nodes=128x2,mpi_zerocopy=on ./xhpl 
> amik> > 
> amik> 
> amik> ----------------------
> amik> rM
> amik> ----------------------
> amik> 
> amik> | amik at stokes 09:05:08 Linux_ATHLON_CBLAS> scrun
> amik> -nodes=128x2,mpi_zerocopy=on ./xhpl 
> amik> SCore-D 5.0.1 connected.
> amik> <0:0> SCORE: 256 nodes (128x2) ready.
> amik> ============================================================================
> amik> HPLinpack 1.0  --  High-Performance Linpack benchmark  --  September 27,
> amik> 2000
> amik> Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs., 
> amik> UTK
> amik> ============================================================================
> amik> 
> amik> An explanation of the input/output parameters follows:
> amik> T/V    : Wall time / encoded variant.
> amik> N      : The order of the coefficient matrix A.
> amik> NB     : The partitioning blocking factor.
> amik> P      : The number of process rows.
> amik> Q      : The number of process columns.
> amik> Time   : Time in seconds to solve the linear system.
> amik> Gflops : Rate of execution for solving the linear system.
> amik> 
> amik> The following parameter values will be used:
> amik> 
> amik> N      :   10000 
> amik> NB     :     100 
> amik> P      :      16 
> amik> Q      :      16 
> amik> PFACT  :   Crout 
> amik> NBMIN  :       1 
> amik> NDIV   :      16 
> amik> RFACT  :   Right 
> amik> BCAST  :   1ring 
> amik> DEPTH  :       1 
> amik> SWAP   : Mix (threshold = 16)
> amik> L1     : transposed form
> amik> U      : transposed form
> amik> EQUIL  : yes
> amik> ALIGN  : 8 double precision words
> amik> 
> amik> ----------------------------------------------------------------------------
> amik> 
> amik> - The matrix A is randomly generated for each test.
> amik> - The following scaled residual checks will be computed:
> amik>    1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
> amik>    2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
> amik>    3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
> amik> - The relative machine precision (eps) is taken to be         
> amik> 2.220446e-16
> amik> - Computational tests pass if scaled residuals are less than          
> amik> 16.0
> amik> 
> amik> [117] Receive Error: myriIsReadDone: Bad address(14)(1000e): 000011d0
> amik> 00000041 0000c468 00001718 00007080 00001180 000052a8 08978000 
> amik> [117] Receive Error: myriIsReadDone: Bad address(14)(1000e): 000011d0
> amik> 00000041 0000c468 00007080 00007080 000011d0 00005968 0000c000 
> amik> comp_is_read_done(0x85bc7a0): pmIsReadDone(0x85bec48): Bad address(14)
> amik> <117:1> SCORE:WARNING MPICH/SCore: pmIsReadDone(pmc=0x85bc7a0) failed,
> amik> errno=14
> amik> <117:1> SCORE:PANIC MPICH/SCore: critical error on message transfer
> amik> <117:1> Trying to attach GDB (DISPLAY=localhost:10.0): PANIC
> amik> comp_is_read_done(0x85bc7a0): pmIsReadDone(0x85bec48): Bad address(14)
> amik> <117:0> SCORE:WARNING MPICH/SCore: pmIsReadDone(pmc=0x85bc7a0) failed,
> amik> errno=14
> amik> <117:0> SCORE:PANIC MPICH/SCore: critical error on message transfer
> amik> <117:0> Trying to attach GDB (DISPLAY=localhost:10.0): PANIC
> amik> SCORE: Program aborted.
> 
> This error shows that PM/Myrinet firmware on Myrinet NIC failed to
> read page table of destination user space.
> 
> If this error is occurred on node #117 (cn118?), this may be hardware
> problem on the node. 
> 
> amik>cn118
> amik>8	759960
> 
> amik> ----------------------
> amik> rM
> amik> ----------------------
> amik> 
> amik> 
> amik> | amik at stokes 09:06:31 Linux_ATHLON_CBLAS> scrun
> amik> -nodes=128x2,mpi_zerocopy=on ./xhpl 
> amik> SCore-D 5.0.1 connected.
> amik> <0:0> SCORE: 256 nodes (128x2) ready.
> amik> ============================================================================
> amik> HPLinpack 1.0  --  High-Performance Linpack benchmark  --  September 27,
> amik> 2000
> amik> Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs., 
> amik> UTK
> amik> ============================================================================
> amik> 
> amik> An explanation of the input/output parameters follows:
> amik> T/V    : Wall time / encoded variant.
> amik> N      : The order of the coefficient matrix A.
> amik> NB     : The partitioning blocking factor.
> amik> P      : The number of process rows.
> amik> Q      : The number of process columns.
> amik> Time   : Time in seconds to solve the linear system.
> amik> Gflops : Rate of execution for solving the linear system.
> amik> 
> amik> The following parameter values will be used:
> amik> 
> amik> N      :   10000 
> amik> NB     :     100 
> amik> P      :      16 
> amik> Q      :      16 
> amik> PFACT  :   Crout 
> amik> NBMIN  :       1 
> amik> NDIV   :      16 
> amik> RFACT  :   Right 
> amik> BCAST  :  1ringM 
> amik> DEPTH  :       1 
> amik> SWAP   : Mix (threshold = 16)
> amik> L1     : transposed form
> amik> U      : transposed form
> amik> EQUIL  : yes
> amik> ALIGN  : 8 double precision words
> amik> 
> amik> ----------------------------------------------------------------------------
> amik> 
> amik> - The matrix A is randomly generated for each test.
> amik> - The following scaled residual checks will be computed:
> amik>    1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
> amik>    2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
> amik>    3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
> amik> - The relative machine precision (eps) is taken to be         
> amik> 2.220446e-16
> amik> - Computational tests pass if scaled residuals are less than          
> amik> 16.0
> amik> 
> amik> ============================================================================
> amik> T/V                N    NB     P     Q               Time            
> amik> Gflops
> amik> ----------------------------------------------------------------------------
> amik> W11R16C1        10000   100    16    16              10.18         
> amik> 6.550e+01
> amik> ----------------------------------------------------------------------------
> amik> ||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0300470 ......
> amik> PASSED
> amik> ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0071100 ......
> amik> PASSED
> amik> ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0015892 ......
> amik> PASSED
> amik> ============================================================================
> amik> 
> amik> Finished      1 tests with the following results:
> amik>               1 tests completed and passed residual checks,
> amik>               0 tests completed and failed residual checks,
> amik>               0 tests skipped because of illegal input values.
> amik> ----------------------------------------------------------------------------
> amik> 
> amik> End of Tests.
> amik> ============================================================================
> amik> 
> amik> 
> amik> > 
> amik> > When these tests do not help with analysis of the situation, 
> amik> > I will send special Myrinet control programs.
> amik> > 
> amik> > By the way, you are using Athlon processor, so
> amik> > 
> amik> > What chipset does the MB have? Early 760MP-X chipset has very quorious
> amik> > situation on zero-copy communication.
> amik> > 
> amik> 
> amik> Yes Indeed, we have those.
> amik> 
> amik> > If you have a time, I recommend bustest (PCI DMA bandwidth performance
> amik> > measurement) and communication test of PM message and zero-copy
> amik> > communication .
> amik> > 
> amik> 
> amik> We have time, we want this project to work. We will do, I will ask our
> amik> technician to do the testing.
> amik> 
> amik> PM/Myrinet Test Procedure:
> amik> 1)
> amik> myrinet
> amik> 
> amik> | amik at stokes 09:24:26 sbin> ./rpminit cn1 myrinet
> amik> | amik at stokes 09:27:13 sbin> ./rpmtest cn1 myrinet -dest 0 -ping
> amik> 8	5.8716e-06
> amik> 
> amik> myrinet2k
> amik> | amik at stokes 10:14:19 sbin> ./rpminit cn1 myrinet2k
> amik> | amik at stokes 10:14:24 sbin> ./rpmtest cn1 myrinet2k -dest 0 -ping
> amik> 8	7.49937e-06
> amik> 
> amik> (see 3 (slow node 61))
> amik> | amik at stokes 10:15:46 sbin> ./rpminit cn61 myrinet2k
> amik> | amik at stokes 10:15:53 sbin> ./rpmtest cn61 myrinet2k -dest 60 -ping
> amik> 8	1.03231e-05
> amik> 
> amik> 2)
> amik> 
> amik> Myrinet:
> amik> 
> amik> Shell A:
> amik> | amik at stokes 09:29:47 sbin> ./rpmtest cn2 myrinet -reply    
> amik> 
> amik> Shell B:
> amik> | amik at stokes 09:30:57 sbin> ./rpmtest cn1 myrinet -dest 1 -ping 
> amik> 8	1.00363e-05
> amik> 
> amik> Myrinet2k:
> amik> Shell A:
> amik> ./rpmtest cn2 myrinet2k -reply
> amik> 
> amik> Shell B:
> amik> | amik at stokes 10:13:14 sbin> ./rpmtest cn1 myrinet2k -dest 1 -ping
> amik> 8	1.20059e-05
> amik> 
> amik> (see 3 (slow node 61))
> amik> Shell A:
> amik> | amik at stokes 10:18:09 sbin> ./rpmtest cn61 myrinet2k -reply 
> amik> 
> amik> Shell B:
> amik> | amik at stokes 10:18:27 sbin> ./rpmtest cn1 myrinet2k -dest 60 -ping
> amik> 8	1.4752e-05
> amik> 
> amik> 
> amik> 3)
> amik> Shell A:
> amik> | amik at stokes 09:56:39 sbin> ./rpmtest cn3 myrinet2k -vreply
> amik> 
> amik> Shell B:
> amik> | amik at stokes 10:18:33 sbin> cat ~/bin/do_all 
> amik> #!/bin/bash
> amik> 
> amik> liste=`(cat /etc/hosts | grep cn | awk '{print $3}')`
> amik> 
> amik> for i in $liste;do
> amik> echo $i
> amik> /opt/score/sbin/rpmtest $i myrinet2k -dest $1 -vwrite
> amik> done
> amik> 
> amik> 
> amik> | amik at stokes 09:53:47 sbin> do_all 2
> amik> cn1
> amik> 8	1.21528e+06
> amik> cn2
> amik> 8	1.21555e+06
> amik> cn3
> amik> myriMapLANai(0, 0xbffff794, 0): pm_open("/dev/pmmyri/0", O_RDWR, 0):
> amik> Device or resource busy(16)
> amik> myriOpenDevice("/var/scored/scoreboard/stokes.0000Z000C-WE",
> amik> "/var/scored/scoreboard/stokes.0000Z0006MGS", 0xbffff7d8):
> amik> myriMapLANai(0, 0xbffff794): Device or resource busy(16)
> amik> myri_open_device(0, 0xbffff9c0, 0x80adc80):
> amik> myriOpenDevice("/var/scored/scoreboard/stokes.0000Z000C-WE",
> amik> "/var/scored/scoreboard/stokes.0000Z0006MGS", 0xbffff7d8): Device or
> amik> resource busy(16)
> amik> pmOpenDevice: Device or resource busy(16)
> amik> cn4
> amik> 8	1.22084e+06
> amik> cn5
> amik> 8	1.22031e+06
> amik> cn6
> amik> 8	1.21936e+06
> amik> cn7
> amik> 8	1.21954e+06
> amik> cn8
> amik> 8	1.21908e+06
> amik> cn9
> amik> 8	1.17938e+06
> amik> cn10
> amik> 8	1.18443e+06
> amik> cn11
> amik> 8	1.18103e+06
> amik> cn12
> amik> 8	1.17963e+06
> amik> cn13
> amik> 8	1.18098e+06
> amik> cn14
> amik> 8	1.20557e+06
> amik> cn15
> amik> 8	1.17934e+06
> amik> cn16
> amik> 8	1.17928e+06
> amik> cn17
> amik> 8	1.17943e+06
> amik> cn18
> amik> 8	1.17936e+06
> amik> cn19
> amik> 8	1.17979e+06
> amik> cn20
> amik> 8	1.18086e+06
> amik> cn21
> amik> 8	1.17987e+06
> amik> cn22
> amik> 8	1.17724e+06
> amik> cn23
> amik> 8	1.18379e+06
> amik> cn24
> amik> 8	1.18124e+06
> amik> cn25
> amik> 8	1.17841e+06
> amik> cn26
> amik> 8	1.19677e+06
> amik> cn27
> amik> 8	1.17885e+06
> amik> cn28
> amik> 8	1.18305e+06
> amik> cn29
> amik> 8	1.18064e+06
> amik> cn30
> amik> 8	1.17977e+06
> amik> cn31
> amik> 8	1.17963e+06
> amik> cn32
> amik> 8	1.17994e+06
> amik> cn33
> amik> 8	1.17912e+06
> amik> cn34
> amik> 8	1.19678e+06
> amik> cn35
> amik> 8	1.20869e+06
> amik> cn36
> amik> 8	1.18085e+06
> amik> cn37
> amik> 8	1.18299e+06
> amik> cn38
> amik> 8	1.17977e+06
> amik> cn39
> amik> 8	1.18088e+06
> amik> cn40
> amik> 8	1.18038e+06
> amik> cn41
> amik> 8	1.17923e+06
> amik> cn42
> amik> 8	1.17943e+06
> amik> cn43
> amik> 8	1.18067e+06
> amik> cn44
> amik> 8	1.18168e+06
> amik> cn45
> amik> 8	1.18139e+06
> amik> cn46
> amik> 8	1.18099e+06
> amik> cn47
> amik> 8	1.18203e+06
> amik> cn48
> amik> 8	1.18082e+06
> amik> cn49
> amik> 8	1.17946e+06
> amik> cn50
> amik> 8	1.1803e+06
> amik> cn51
> amik> 8	1.1804e+06
> amik> cn52
> amik> 8	1.18191e+06
> amik> cn53
> amik> 8	1.18401e+06
> amik> cn54
> amik> 8	1.18331e+06
> amik> cn55
> amik> 8	1.18326e+06
> amik> cn56
> amik> 8	1.18173e+06
> amik> cn57
> amik> 8	1.18018e+06
> amik> cn58
> amik> 8	1.18091e+06
> amik> cn59
> amik> 8	1.18022e+06
> amik> cn60
> amik> 8	1.1818e+06
> amik> cn61
> amik> 8	755405
> amik> cn62
> amik> 8	1.18278e+06
> amik> cn63
> amik> 8	1.20865e+06
> amik> cn64
> amik> 8	1.18166e+06
> amik> cn65
> amik> 8	1.18006e+06
> amik> cn66
> amik> 8	1.17926e+06
> amik> cn67
> amik> 8	1.17955e+06
> amik> cn68
> amik> 8	1.20513e+06
> amik> cn69
> amik> 8	1.17968e+06
> amik> cn70
> amik> 8	1.17986e+06
> amik> cn71
> amik> 8	1.18094e+06
> amik> cn72
> amik> 8	1.18195e+06
> amik> cn73
> amik> 8	1.19813e+06
> amik> cn74
> amik> 8	1.18053e+06
> amik> cn75
> amik> 8	1.17967e+06
> amik> cn76
> amik> 8	1.17978e+06
> amik> cn77
> amik> 8	1.17905e+06
> amik> cn78
> amik> 8	1.18084e+06
> amik> cn79
> amik> 8	1.18611e+06
> amik> cn80
> amik> 8	755487
> amik> cn81
> amik> 8	1.17952e+06
> amik> cn82
> amik> 8	1.17863e+06
> amik> cn83
> amik> 8	1.17988e+06
> amik> cn84
> amik> 8	1.20254e+06
> amik> cn85
> amik> 8	1.20848e+06
> amik> cn86
> amik> 8	1.18039e+06
> amik> cn87
> amik> 8	1.17919e+06
> amik> cn88
> amik> 8	1.17975e+06
> amik> cn89
> amik> 8	755277
> amik> cn90
> amik> 8	1.1792e+06
> amik> cn91
> amik> 8	1.18081e+06
> amik> cn92
> amik> 8	1.17974e+06
> amik> cn93
> amik> 8	1.17997e+06
> amik> cn94
> amik> 8	1.17914e+06
> amik> cn95
> amik> 8	1.18022e+06
> amik> cn96
> amik> 8	1.17897e+06
> amik> cn97
> amik> 8	1.18451e+06
> amik> cn98
> amik> 8	1.18447e+06
> amik> cn99
> amik> 8	1.20977e+06
> amik> cn100
> amik> 8	1.18888e+06
> amik> cn101
> amik> 8	1.18841e+06
> amik> cn102
> amik> 8	1.18935e+06
> amik> cn103
> amik> 8	1.18774e+06
> amik> cn104
> amik> 8	1.18739e+06
> amik> cn105
> amik> 8	1.20806e+06
> amik> cn106
> amik> 8	1.18413e+06
> amik> cn107
> amik> 8	1.18674e+06
> amik> cn108
> amik> 8	1.1868e+06
> amik> cn109
> amik> 8	1.21215e+06
> amik> cn110
> amik> 8	755543
> amik> cn111
> amik> 8	1.18941e+06
> amik> cn112
> amik> 8	1.18727e+06
> amik> cn113
> amik> 8	1.18388e+06
> amik> cn114
> amik> 8	1.18409e+06
> amik> cn115
> amik> 8	1.18515e+06
> amik> cn116
> amik> 8	1.18756e+06
> amik> cn117
> amik> 8	1.18788e+06
> amik> cn118
> amik> 8	759960
> amik> cn119
> amik> 8	1.2122e+06
> amik> cn120
> amik> 8	1.18718e+06
> amik> cn121
> amik> 8	1.18379e+06
> amik> cn122
> amik> 8	1.18409e+06
> amik> cn123
> amik> 8	1.18601e+06
> amik> cn124
> amik> 8	1.18686e+06
> amik> cn125
> amik> 8	1.18783e+06
> amik> cn126
> amik> 8	1.18639e+06
> amik> cn127
> amik> 8	1.18707e+06
> amik> cn128
> amik> 8	1.18881e+06
> amik> 
> amik> Nodes 61,80,89,110,118 are slow.
> amik> I have repeated the test with ./rpmtest cn2 myrinet2k -vreply and
> amik> the same nodes are slow : 61,80,89,110,118 (!)
> amik> 
> amik> 
> amik> 4)
> amik> 
> amik> | amik at stokes 09:55:02 deploy> ./scstest -network myrinet2k
> amik> SCSTEST: BURST on myrinet2k(chan=0,ctx=0,len=16)
> amik> 50 K packets.
> amik> 100 K packets.
> amik> 150 K packets.
> amik> 200 K packets.
> amik> 250 K packets.
> amik> 300 K packets.
> amik> ...
> amik> 7150 K packets.
> amik> etc
> amik> 
> amik> Seems to work fine.
> amik> 
> ------
> Shinji Sumimoto, Fujitsu Labs
-- 
_____________________________________________________
Dr. A. St-Cyr
Research Associate, CFD Lab
Department of Mechanical Engineering
McGill University
688 Sherbrooke Street West, 7th floor
Montreal, Qc, Canada H3A 2S6
Tel: +1 (514) 398-1710, Admin. Fax : 2203
amik at cfdlab.mcgill.ca
_____________________________________________________




More information about the SCore-users mailing list