[SCore-users-jp] Re: [SCore-users] Copper Myrinet pm problems

Shinji Sumimoto s-sumi @ bd6.so-net.ne.jp
2003年 2月 2日 (日) 09:30:18 JST


Hi.

Could you check CRC errors of the cluster nodes?

Shinji.

From: Nick Birkett <nrcb @ streamline-computing.com>
Subject: [SCore-users] Copper Myrinet pm problems
Date: Sat, 1 Feb 2003 18:17:25 +0000
Message-ID: <200302011817.h11IHPK10399 @ zeralda.streamline.com>

nrcb> Hi, we have just upgraded one of our older clusters to SCore 5.0.1 from 4.x
nrcb> (I think it was the first SCore to support Myrinet 2000 from 18 months ago - 
nrcb> RedHat 6.2 dist).
nrcb> 
nrcb> The cluster was working more or less ok under the old SCore system.
nrcb> 
nrcb> The entire system has been re-installed as RedHat 7.2 + SCore 5.0.1.
nrcb> 
nrcb> Hardware: Copper based Myrinet2k (May 2001) and Pentium III dual 866Mhz 
nrcb> SuperMicro 1U Superservers.
nrcb> 
nrcb> I have run rpmtest and the scstest -network myrinet2k for many hours over all 
nrcb> compute nodes without problems.
nrcb> 
nrcb> Have run gm1.6.3 codes (e.g PMB) and they work fine.
nrcb> 
nrcb> SCore PM codes are having problems over Myrinet - e.g  running PMB:
nrcb> 
nrcb> <6:0> SCORE:WARNING MPICH/SCore    [buffer=0x8951498, type=1025, from=11, 
nrcb> size=262144, offset=189520]
nrcb> <6:0> SCORE:WARNING MPICH/SCore: receive-message-queue:
nrcb> <6:0> SCORE:WARNING MPICH/SCore    (empty)
nrcb> <6:0> SCORE:WARNING MPICH/SCore: received-fragment:
nrcb> <6:0> SCORE:WARNING MPICH/SCore    [buffer=0x40066180, type=1025, from=11, 
nrcb> size=262144, fragment_size=8240, offset=189521]
nrcb> <6:0> SCORE:WARNING MPICH/SCore: queued-message:
nrcb> <6:0> SCORE:WARNING MPICH/SCore    [buffer=0x8951498, type=1025, from=11, 
nrcb> size=262144, offset=189520]
nrcb> <6:0> SCORE:WARNING MPICH/SCore: received an invalid fragment (mismatched 
nrcb> offset)
nrcb> <6:0> SCORE:PANIC MPICH/SCore: critical error on message transfer
nrcb> <6:0> Trying to attach GDB (DISPLAY=localhost:10.0): PANIC
nrcb> SCORE: Program aborted.
nrcb> SCOUT: Session done.
nrcb> 
nrcb> 
nrcb> Lots of buffer mismatch errors. The same binary runs fine over ethernet or 
nrcb> gigabit on the same hardware (i.e if add the -network=ethernet option then
nrcb> all ok so it is a Myrinet problem).
nrcb> 
nrcb> We would like to keep SCore as the cluster has some new Xeon Gigabit nodes
nrcb> but will have to convert to GM if we cannot resolve this.
nrcb> 
nrcb> Looks like a hardware problem (same code runs fine over Score 5.0.1 and fibre
nrcb> optic Myrinet 2k on Intel Xeon systems).
nrcb> 
nrcb> Thanks,
nrcb> 
nrcb> Nick
nrcb> 
nrcb> 
nrcb> 
nrcb> 
nrcb> 
nrcb> 
nrcb> 
nrcb> 
nrcb> 
nrcb>  
nrcb> _______________________________________________
nrcb> SCore-users mailing list
nrcb> SCore-users @ pccluster.org
nrcb> http://www.pccluster.org/mailman/listinfo/score-users
nrcb> 
-----
Shinji Sumimoto    E-Mail: s-sumi @ bd6.so-net.ne.jp
_______________________________________________
SCore-users mailing list
SCore-users @ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users



SCore-users-jp メーリングリストの案内