[SCore-users] score 5.0.1 large memory jobs

Thu Feb 27 19:29:11 JST 2003

Sorry, the message I sent was truncated and therefore confusing.

Here it is again:

Score 5.0.1, Myrinet 2000 system.

---------message from user -------------------------------

Following the addition of swap on all Snowdon compute nodes, I reran the
PALLAS benchmark tests (on 64 nodes running 2 processes per node). The
following output was recorded towards the end of the run:

#----------------------------------------------------------------
# Benchmarking Alltoall 
# ( #processes = 64 ) 
# ( 64 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000       784.39       784.86       784.79
            1         1000       792.01       792.18       792.10
            2         1000       785.42       785.78       785.68
            4         1000       796.99       797.23       797.14
            8         1000       800.98       801.11       801.06
           16         1000       778.84       779.19       779.11
           32         1000       787.78       788.14       788.03
           64         1000       821.54       821.79       821.66
          128         1000       881.18       881.38       881.30
          256         1000       952.46       952.64       952.56
          512         1000      1158.49      1159.00      1158.88
         1024         1000      1640.78      1644.24      1641.00
         2048         1000      3454.18      3454.95      3454.62
         4096         1000      6882.82      6884.97      6883.97
         8192         1000     16088.81     16094.80     16091.81
        16384         1000     33715.59     33732.60     33727.56
        32768         1000     65014.80     65027.62     65023.50
        65536          640    129590.04    129636.99    129623.44
       131072          320    263434.38    263628.56    263587.57
       262144          160    531708.42    532274.39    532124.75
       524288           80   1069253.25   1071251.60   1070571.90
      1048576           40   2173875.02   2187574.55   2184477.23
      2097152           20   4228944.70   4270372.05   4258162.98
      4194304           10   8398147.40   8512784.40   8478838.18
<8> SCore-D:PANIC Network freezing timed out !!

And the .e file states:

<0:0> SCORE: 128 nodes (64x2) ready.
<56:1> SCORE:WARNING MPICH/SCore: pmGetSendBuffer(pmc=0x8541db8, dest=37,
len=8256) failed, errno=22
<56:1> SCORE:PANIC MPICH/SCore: critical error on message transfer
<56:1> Trying to attach GDB (DISPLAY=snowdon.leeds.ac.uk:18.0): PANIC
SCOUT: Session done.

It looks like now the memory allocation is working fine, but the benchmark
is unable to undertake the next test in the benchmark.

The next test is an all-to-all zero length message to 128 processors (on 64
nodes). Extrapolating the results, this should take about 1.6 ms.

It appears as if the communications grind to a halt when we try to
communicate between 128 processes (and over) when running 2 processes per
node.