[SCore-users-jp] [SCore-users] Fwd: problem launching some jobs

Nick Birkett nrcb @ streamline-computing.com
2003年 5月 6日 (火) 22:31:10 JST


Hi I have received this from one of our users.

The system is 128 dual Xeon compute nodes with 
fibre optic Myrinet 2000.

Has anyone seen a similar thing (SCore 5.0.1).

-------------------------------------------------------------------------

An error has been reported to me that occasionally occurs when launching
parallel jobs on Snowdon through Sun Grid Engine.

Basically, the only strange output is in the .e file which specifies:

SCOUT: bind: Address already in use.

The code does not get launched.

Trying to reproduce this error with any degree of consistency has been
tricky. The only way of achieving the error has been to submit multiple
copies of my "hello world" program to SGE and then grep-ing the output files
for the error.

I thought that there would probably be something wrong with a particular
node in the cluster. Looking at the .pe files, my code sometimes works and
sometimes doesn't work, using exactly the same nodes specified in the pe
file.

The error has been seen when running on anything from one node upwards.

I have not been able to reproduce the error by running the program directly
through "scout".

If you want to look at some output files, you're welcome to search through
the output files in ~issanr/test/

we are unsure about
whether the problem is with SGE, SCore or a node configuration.

Thanks for your help,

_______________________________________________
SCore-users mailing list
SCore-users @ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users



SCore-users-jp メーリングリストの案内