[SCore-users-jp] Re: [SCore-users] features mentioned .. + further questions

David Werner david.werner @ iws.uni-stuttgart.de
2004年 2月 18日 (水) 02:01:59 JST


On Tue, Feb 17, 2004 at 11:22:21PM +0900, Atsushi HORI wrote:

> I am responsible these features, processor allocation of SCore, and I 
> know many users want to sepcify the hosts to run their program, but I 
> can NOT understand at all.
> 
> My design policy is that SCore knows the loads of entire system 
> (cluster) and can choose an appropriate subset of clusters for jobs 
> according to the system loads. Think about the case when two or more 
> users submit jobs almost simultanesouly. If they could choose the 
> hosts where the least number of jobs are running, they would do. But 
> the next moment, their simultaneously submitted jobs might run on the 
> same host set !! This situation can avoid if SCore do the "scheduling."
> 
> The same policy can be found on Linux/Unix. There might be two or 
> more number of CPUs in a PC, but user(s) can not specify which CPU to 
> run his/her job.
> 
> Do you want to specify the host set, still ?

Hi again together, 

Thank you for your response. Thanks to Kameyama Toyohisa too!
We have the problem that some jobs are 
not parallelized but these often tend to require an extra amount
of memory.  Since one third of our cluster is endowed
with more RAM, my thought was to give the user the chance 
to start those jobs on these nodes.
As a job usually allocates dynamically memory
I think score could not have the chance to detect 
such a situation in advance.  I also suspect 
that there are is not yet being implemented the ability to 
migrate such a job onto a different node; although 
in principle as features like checkpointing work it might be 
possible and in the range of future enhancements.

Another feature I awfully miss is to extend or or shrink 
a cluster in its number of running nodes without 
interrupting jobs on nodes which are not affected, i.e.
not having a restart at all.
This for example could make it possible to reinstall a cluster 
by parts without interrupting the computing service.

We experienced in the past days hardware problems on many nodes
and an operation scheme for node exclusion and reintegration
without restart would be in this situation also helpful.
As best approximation to that we now use since a few days 
the scheme for Automatic Operation and High Availability.
But I expect to have to do a restart when we get back some 
five nodes out of repair. 
Is it then possible to stay in the current session with 
an extended set of nodes (although restart seems to be inavoidable)
or do I have the constrain that an enlarged set of nodes
(scorehosts) requirers to start a complete new session.

Am I expecting too much? 
How is the development of score in regard of such goals 
being structured?

Greetings, 
	David

> 
> ----
> Atsushi HORI
> Swimmy Software, Inc.
_______________________________________________
SCore-users mailing list
SCore-users @ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users



SCore-users-jp メーリングリストの案内