[SCore-users-jp] Re: [SCore-users] Questions about Score scheduling scheme

2003年 4月 30日 (水) 16:35:48 JST

Hi,

thank you for your prompt reply.

On Wed, 30 Apr 2003, Atsushi HORI wrote:

> Hi,
> 
> >1. It would be natural if another job with priority 0 would go to a 
> >node where there
> >is no job with priority 0 running. This is not the case with 
> >Score-5.4.0. For example,
> >if the execution of job0 running with priority 0 depends on the 
> >output of job1 starting
> >later with priority 0 and the job1 is scheduled to the node where 
> >job0 is already running we
> >have a deadlock problem.
> 
> If job1 is depending on job0, or job1 should be started after the 
> completion of job0, then you have to declare the dependency, just 
> like the following way
> 
> % scrun -nodes=XX job0 :: job1
> 
The problem is, that job0 depends on a serie of jobs 
preparing the input data for job0. In order to minimize
the execution time  for job0 we start the job0 when the two first
input files are ready and then we trigger next jobs when the
input data for them arrives.

We use the priority 0 concept because it is the easiest to 
implement. Perhaps we will have to think about other solution
(Score in combination with PBS).

> >2.  If we further play with renicing (sc_console nice command) the 
> >command becomes
> >effective only when the process is suspended/resumed. Is this a 
> >feature or a bug?
> 
> I suppose you are running SCore-D with longer time slice than the 
> default. If this is true, then this is the feature.

We are using the default time slice. Could be there anything additionaly
wronf with our setup?

> 
> >3. For some time we were considering 
> >checkpointing/aborting/restarting but the operation fails
> >with the messages:
> 
> How did you trigger checkpinting ?

I triger checkpointing via sc_console. Then I abort the job
(via sc_console again) and then restart (sc_console). 

I have additional question: This night sc_watch went into
the reboot. I do not suspect hardware failure (hopefully)
but the problem must be somewhere else. We run quite many 
single processor jobs doing a lot of IO. Is it possible that
when they are trying to do the IO at the same time load
goes so hi that sc_watch simply gives up? Is there a way to 
increase the timeout period? (I remember that there was a
post about that in score mailing list, but now I can not find 
it).

Best regards, Jure

_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users