[SCore-users-jp] Re: [SCore-users] ULT:PANIC Try to break thread

Jure Jerman jure.jerman @ rzs-hm.si
2004年 12月 20日 (月) 18:02:12 JST


Hi,

Atsushi HORI wrote:

> 
> I think I fixed the bug.
> 
It is very nice to read this.

> What did you do when the error happens ? Or, how can I reproduce the 
> error ?

This needs a bit of explanation:
- in our operational suite we run few large jobs (on 20+ processors),
   binary is produced by lahey (fujitsu) fortran (ver 6.1) and it was NOT relinked
   after SCore upgrade, so we suspect that the problem could be connected
   to this fact.
- another fact is, that we were not able to reproduce the same error
   with another binary. Once we ran thousands of mpitest jobs and after
   some thousands of successful runs scored hanged, but not with ULT_PANIC
   error.
- the error is occurring quite randomly in it is not reproduceable. One has
   to run larger number (5-10) of SCore jobs to get this error.


> 
> By the way, some line numbers of the debugger output are different from 
> what I am working on.
> Are you sure you upgraded all the nodes ?

We did a clean install of ALL compute nodes and an upgrade of master node
from Redhat 7.3 to Fedora Core 1. SCore software was completly removed from
the master node and a bininstall -server was performed.

Thank you, Jure Jerman

_______________________________________________
SCore-users mailing list
SCore-users @ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users



SCore-users-jp メーリングリストの案内