[SCore-users] score 5.0.1 large memory jobs

Fri Feb 28 12:59:18 JST 2003

Hi Nick.

Could you run the benchmark with PM_DEBUG=1?
If there are PM/Myrinet problems, some messages are output.

Ex: sh
export PM_DEBUG=1  

Shinji.

From: Nick Birkett <nrcb at streamline-computing.com>
Subject: Re: [SCore-users] score 5.0.1 large memory jobs
Date: Thu, 27 Feb 2003 10:29:11 +0000
Message-ID: <200302271029.h1RATB801837 at zeralda.streamline.com>

nrcb> Sorry, the message I sent was truncated and therefore confusing.
nrcb> 
nrcb> Here it is again:
nrcb> 
nrcb> Score 5.0.1, Myrinet 2000 system.
nrcb> 
nrcb> ---------message from user -------------------------------
nrcb> 
nrcb> 
nrcb> Following the addition of swap on all Snowdon compute nodes, I reran the
nrcb> PALLAS benchmark tests (on 64 nodes running 2 processes per node). The
nrcb> following output was recorded towards the end of the run:
nrcb> 
nrcb> #----------------------------------------------------------------
nrcb> # Benchmarking Alltoall 
nrcb> # ( #processes = 64 ) 
nrcb> # ( 64 additional processes waiting in MPI_Barrier)
nrcb> #----------------------------------------------------------------
nrcb>        #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
nrcb>             0         1000       784.39       784.86       784.79
nrcb>             1         1000       792.01       792.18       792.10
nrcb>             2         1000       785.42       785.78       785.68
nrcb>             4         1000       796.99       797.23       797.14
nrcb>             8         1000       800.98       801.11       801.06
nrcb>            16         1000       778.84       779.19       779.11
nrcb>            32         1000       787.78       788.14       788.03
nrcb>            64         1000       821.54       821.79       821.66
nrcb>           128         1000       881.18       881.38       881.30
nrcb>           256         1000       952.46       952.64       952.56
nrcb>           512         1000      1158.49      1159.00      1158.88
nrcb>          1024         1000      1640.78      1644.24      1641.00
nrcb>          2048         1000      3454.18      3454.95      3454.62
nrcb>          4096         1000      6882.82      6884.97      6883.97
nrcb>          8192         1000     16088.81     16094.80     16091.81
nrcb>         16384         1000     33715.59     33732.60     33727.56
nrcb>         32768         1000     65014.80     65027.62     65023.50
nrcb>         65536          640    129590.04    129636.99    129623.44
nrcb>        131072          320    263434.38    263628.56    263587.57
nrcb>        262144          160    531708.42    532274.39    532124.75
nrcb>        524288           80   1069253.25   1071251.60   1070571.90
nrcb>       1048576           40   2173875.02   2187574.55   2184477.23
nrcb>       2097152           20   4228944.70   4270372.05   4258162.98
nrcb>       4194304           10   8398147.40   8512784.40   8478838.18
nrcb> <8> SCore-D:PANIC Network freezing timed out !!
nrcb> 
nrcb> And the .e file states:
nrcb> 
nrcb> <0:0> SCORE: 128 nodes (64x2) ready.
nrcb> <56:1> SCORE:WARNING MPICH/SCore: pmGetSendBuffer(pmc=0x8541db8, dest=37,
nrcb> len=8256) failed, errno=22
nrcb> <56:1> SCORE:PANIC MPICH/SCore: critical error on message transfer
nrcb> <56:1> Trying to attach GDB (DISPLAY=snowdon.leeds.ac.uk:18.0): PANIC
nrcb> SCOUT: Session done.
nrcb> 
nrcb> It looks like now the memory allocation is working fine, but the benchmark
nrcb> is unable to undertake the next test in the benchmark.
nrcb> 
nrcb> The next test is an all-to-all zero length message to 128 processors (on 64
nrcb> nodes). Extrapolating the results, this should take about 1.6 ms.
nrcb> 
nrcb> It appears as if the communications grind to a halt when we try to
nrcb> communicate between 128 processes (and over) when running 2 processes per
nrcb> node.
nrcb> _______________________________________________
nrcb> SCore-users mailing list
nrcb> SCore-users at pccluster.org
nrcb> http://www.pccluster.org/mailman/listinfo/score-users
nrcb> 
nrcb> 
------
Shinji Sumimoto, Fujitsu Labs