[SCore-users-jp] [SCore-users] invalid fragment error

Nick Birkett nrcb @ streamline-computing.com
2003年 5月 9日 (金) 16:01:54 JST


Hi, I have just upgraded a customer's machine from8 nodes to 12 nodes.

Previously it had 8 copper myrinet2k nodes using Score 5.0.1 and worked
without problem.
I added 4 nodes with new fibre myrinet2k cards and added a fibre optic spine
card to the switch. So switch now has 1 copper line card and one fibre spine
card. 

I ran the scstest over all nodes successfully. 

Now we get a intermittent problem. Most of the jobs run fine but sometimes get
a strange error.

This is the pallas PMB benchmarks on 24 cpus over myrinet2k.

SCOUT: Spawning                    com
p03.fdone
.
<0:0> SCORE: 24 nodes (12x2) ready.
<0:0> SCORE:WARNING MPICH/SCore: receive-request-queue:
<0:0> SCORE:WARNING MPICH/SCore    [buffer=0x91645b0, type=1025, from=2,
size=26
2144, offset=8240]
<0:0> SCORE:WARNING MPICH/SCore: receive-message-queue:
<0:0> SCORE:WARNING MPICH/SCore    (empty)
<0:0> SCORE:WARNING MPICH/SCore: received-fragment:
<0:0> SCORE:WARNING MPICH/SCore    [buffer=0x40069d28, type=1213202598,
from=121
3202854, size=1213203110, fragment_size=8240, offset=1213203366]
<0:0> SCORE:WARNING MPICH/SCore: received an invalid fragment (no previous
fragm
ent)
<0:0> SCORE:PANIC MPICH/SCore: critical error on message transfer
<0:0> Trying to attach GDB (no DISPLAY): PANIC
SCORE: Program aborted.
SCOUT: Session done.


However when we run the same job again it runs fine. 

Anyone know what might be the cause of this ?


Thanks,

Nick

_______________________________________________
SCore-users mailing list
SCore-users @ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users



SCore-users-jp メーリングリストの案内