[SCore-users-jp] Re: [SCore-users] Copper Myrinet pm problems
Shinji Sumimoto
s-sumi @ bd6.so-net.ne.jp
2003年 2月 2日 (日) 09:30:18 JST
Hi.
Could you check CRC errors of the cluster nodes?
Shinji.
From: Nick Birkett <nrcb @ streamline-computing.com>
Subject: [SCore-users] Copper Myrinet pm problems
Date: Sat, 1 Feb 2003 18:17:25 +0000
Message-ID: <200302011817.h11IHPK10399 @ zeralda.streamline.com>
nrcb> Hi, we have just upgraded one of our older clusters to SCore 5.0.1 from 4.x
nrcb> (I think it was the first SCore to support Myrinet 2000 from 18 months ago -
nrcb> RedHat 6.2 dist).
nrcb>
nrcb> The cluster was working more or less ok under the old SCore system.
nrcb>
nrcb> The entire system has been re-installed as RedHat 7.2 + SCore 5.0.1.
nrcb>
nrcb> Hardware: Copper based Myrinet2k (May 2001) and Pentium III dual 866Mhz
nrcb> SuperMicro 1U Superservers.
nrcb>
nrcb> I have run rpmtest and the scstest -network myrinet2k for many hours over all
nrcb> compute nodes without problems.
nrcb>
nrcb> Have run gm1.6.3 codes (e.g PMB) and they work fine.
nrcb>
nrcb> SCore PM codes are having problems over Myrinet - e.g running PMB:
nrcb>
nrcb> <6:0> SCORE:WARNING MPICH/SCore [buffer=0x8951498, type=1025, from=11,
nrcb> size=262144, offset=189520]
nrcb> <6:0> SCORE:WARNING MPICH/SCore: receive-message-queue:
nrcb> <6:0> SCORE:WARNING MPICH/SCore (empty)
nrcb> <6:0> SCORE:WARNING MPICH/SCore: received-fragment:
nrcb> <6:0> SCORE:WARNING MPICH/SCore [buffer=0x40066180, type=1025, from=11,
nrcb> size=262144, fragment_size=8240, offset=189521]
nrcb> <6:0> SCORE:WARNING MPICH/SCore: queued-message:
nrcb> <6:0> SCORE:WARNING MPICH/SCore [buffer=0x8951498, type=1025, from=11,
nrcb> size=262144, offset=189520]
nrcb> <6:0> SCORE:WARNING MPICH/SCore: received an invalid fragment (mismatched
nrcb> offset)
nrcb> <6:0> SCORE:PANIC MPICH/SCore: critical error on message transfer
nrcb> <6:0> Trying to attach GDB (DISPLAY=localhost:10.0): PANIC
nrcb> SCORE: Program aborted.
nrcb> SCOUT: Session done.
nrcb>
nrcb>
nrcb> Lots of buffer mismatch errors. The same binary runs fine over ethernet or
nrcb> gigabit on the same hardware (i.e if add the -network=ethernet option then
nrcb> all ok so it is a Myrinet problem).
nrcb>
nrcb> We would like to keep SCore as the cluster has some new Xeon Gigabit nodes
nrcb> but will have to convert to GM if we cannot resolve this.
nrcb>
nrcb> Looks like a hardware problem (same code runs fine over Score 5.0.1 and fibre
nrcb> optic Myrinet 2k on Intel Xeon systems).
nrcb>
nrcb> Thanks,
nrcb>
nrcb> Nick
nrcb>
nrcb>
nrcb>
nrcb>
nrcb>
nrcb>
nrcb>
nrcb>
nrcb>
nrcb>
nrcb> _______________________________________________
nrcb> SCore-users mailing list
nrcb> SCore-users @ pccluster.org
nrcb> http://www.pccluster.org/mailman/listinfo/score-users
nrcb>
-----
Shinji Sumimoto E-Mail: s-sumi @ bd6.so-net.ne.jp
_______________________________________________
SCore-users mailing list
SCore-users @ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users
SCore-users-jp メーリングリストの案内