[SCore-users-jp] Re: [SCore-users] score 5.0.1 large memory jobs
Nick Birkett
nrcb @ streamline-computing.com
2003年 2月 27日 (木) 19:29:11 JST
Sorry, the message I sent was truncated and therefore confusing.
Here it is again:
Score 5.0.1, Myrinet 2000 system.
---------message from user -------------------------------
Following the addition of swap on all Snowdon compute nodes, I reran the
PALLAS benchmark tests (on 64 nodes running 2 processes per node). The
following output was recorded towards the end of the run:
#----------------------------------------------------------------
# Benchmarking Alltoall
# ( #processes = 64 )
# ( 64 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 784.39 784.86 784.79
1 1000 792.01 792.18 792.10
2 1000 785.42 785.78 785.68
4 1000 796.99 797.23 797.14
8 1000 800.98 801.11 801.06
16 1000 778.84 779.19 779.11
32 1000 787.78 788.14 788.03
64 1000 821.54 821.79 821.66
128 1000 881.18 881.38 881.30
256 1000 952.46 952.64 952.56
512 1000 1158.49 1159.00 1158.88
1024 1000 1640.78 1644.24 1641.00
2048 1000 3454.18 3454.95 3454.62
4096 1000 6882.82 6884.97 6883.97
8192 1000 16088.81 16094.80 16091.81
16384 1000 33715.59 33732.60 33727.56
32768 1000 65014.80 65027.62 65023.50
65536 640 129590.04 129636.99 129623.44
131072 320 263434.38 263628.56 263587.57
262144 160 531708.42 532274.39 532124.75
524288 80 1069253.25 1071251.60 1070571.90
1048576 40 2173875.02 2187574.55 2184477.23
2097152 20 4228944.70 4270372.05 4258162.98
4194304 10 8398147.40 8512784.40 8478838.18
<8> SCore-D:PANIC Network freezing timed out !!
And the .e file states:
<0:0> SCORE: 128 nodes (64x2) ready.
<56:1> SCORE:WARNING MPICH/SCore: pmGetSendBuffer(pmc=0x8541db8, dest=37,
len=8256) failed, errno=22
<56:1> SCORE:PANIC MPICH/SCore: critical error on message transfer
<56:1> Trying to attach GDB (DISPLAY=snowdon.leeds.ac.uk:18.0): PANIC
SCOUT: Session done.
It looks like now the memory allocation is working fine, but the benchmark
is unable to undertake the next test in the benchmark.
The next test is an all-to-all zero length message to 128 processors (on 64
nodes). Extrapolating the results, this should take about 1.6 ms.
It appears as if the communications grind to a halt when we try to
communicate between 128 processes (and over) when running 2 processes per
node.
_______________________________________________
SCore-users mailing list
SCore-users @ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users
SCore-users-jp メーリングリストの案内