[SCore-users] score 5.0.1 large memory jobs
Shinji Sumimoto
s-sumi at flab.fujitsu.co.jp
Fri Feb 28 12:59:18 JST 2003
Hi Nick.
Could you run the benchmark with PM_DEBUG=1?
If there are PM/Myrinet problems, some messages are output.
Ex: sh
export PM_DEBUG=1
Shinji.
From: Nick Birkett <nrcb at streamline-computing.com>
Subject: Re: [SCore-users] score 5.0.1 large memory jobs
Date: Thu, 27 Feb 2003 10:29:11 +0000
Message-ID: <200302271029.h1RATB801837 at zeralda.streamline.com>
nrcb> Sorry, the message I sent was truncated and therefore confusing.
nrcb>
nrcb> Here it is again:
nrcb>
nrcb> Score 5.0.1, Myrinet 2000 system.
nrcb>
nrcb> ---------message from user -------------------------------
nrcb>
nrcb>
nrcb> Following the addition of swap on all Snowdon compute nodes, I reran the
nrcb> PALLAS benchmark tests (on 64 nodes running 2 processes per node). The
nrcb> following output was recorded towards the end of the run:
nrcb>
nrcb> #----------------------------------------------------------------
nrcb> # Benchmarking Alltoall
nrcb> # ( #processes = 64 )
nrcb> # ( 64 additional processes waiting in MPI_Barrier)
nrcb> #----------------------------------------------------------------
nrcb> #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
nrcb> 0 1000 784.39 784.86 784.79
nrcb> 1 1000 792.01 792.18 792.10
nrcb> 2 1000 785.42 785.78 785.68
nrcb> 4 1000 796.99 797.23 797.14
nrcb> 8 1000 800.98 801.11 801.06
nrcb> 16 1000 778.84 779.19 779.11
nrcb> 32 1000 787.78 788.14 788.03
nrcb> 64 1000 821.54 821.79 821.66
nrcb> 128 1000 881.18 881.38 881.30
nrcb> 256 1000 952.46 952.64 952.56
nrcb> 512 1000 1158.49 1159.00 1158.88
nrcb> 1024 1000 1640.78 1644.24 1641.00
nrcb> 2048 1000 3454.18 3454.95 3454.62
nrcb> 4096 1000 6882.82 6884.97 6883.97
nrcb> 8192 1000 16088.81 16094.80 16091.81
nrcb> 16384 1000 33715.59 33732.60 33727.56
nrcb> 32768 1000 65014.80 65027.62 65023.50
nrcb> 65536 640 129590.04 129636.99 129623.44
nrcb> 131072 320 263434.38 263628.56 263587.57
nrcb> 262144 160 531708.42 532274.39 532124.75
nrcb> 524288 80 1069253.25 1071251.60 1070571.90
nrcb> 1048576 40 2173875.02 2187574.55 2184477.23
nrcb> 2097152 20 4228944.70 4270372.05 4258162.98
nrcb> 4194304 10 8398147.40 8512784.40 8478838.18
nrcb> <8> SCore-D:PANIC Network freezing timed out !!
nrcb>
nrcb> And the .e file states:
nrcb>
nrcb> <0:0> SCORE: 128 nodes (64x2) ready.
nrcb> <56:1> SCORE:WARNING MPICH/SCore: pmGetSendBuffer(pmc=0x8541db8, dest=37,
nrcb> len=8256) failed, errno=22
nrcb> <56:1> SCORE:PANIC MPICH/SCore: critical error on message transfer
nrcb> <56:1> Trying to attach GDB (DISPLAY=snowdon.leeds.ac.uk:18.0): PANIC
nrcb> SCOUT: Session done.
nrcb>
nrcb> It looks like now the memory allocation is working fine, but the benchmark
nrcb> is unable to undertake the next test in the benchmark.
nrcb>
nrcb> The next test is an all-to-all zero length message to 128 processors (on 64
nrcb> nodes). Extrapolating the results, this should take about 1.6 ms.
nrcb>
nrcb> It appears as if the communications grind to a halt when we try to
nrcb> communicate between 128 processes (and over) when running 2 processes per
nrcb> node.
nrcb> _______________________________________________
nrcb> SCore-users mailing list
nrcb> SCore-users at pccluster.org
nrcb> http://www.pccluster.org/mailman/listinfo/score-users
nrcb>
nrcb>
------
Shinji Sumimoto, Fujitsu Labs
More information about the SCore-users
mailing list