SCore-D Job Scheduling Scheme


This ducument describes how SCore-D manages host, memory and disk resources. Most of the description is effective only for SCore-D running in multi-user mode. The resource management can be controled via sc_console command.

SCore-D allocates compute hosts to user jobs. When SCore-D is invoked, SCore-D configure cluster (of clusters) by itself according to the information described in scorehosts.db file. Here, a cluster is a set of hosts having the same processors. If the SCOUT environemnt in which SCore-D is running consists of several types of processors, then SCore-D internally creates a cluster consisting of clusters.

When a user submit a job, its resource requirement is passed to SCore-D, and SCore-D tries to allocate an appropriate cluster or clusters, depending on user's resource specification. When SCore-D finds two or more possible clusters, then SCore-D allocates less loaded cluster. Here, speed number specified in scorehosts.db file is used. Let's consider there are two clusters one of which has 400 speed CPUs and the other has 800 speed CPUs, and the both cluster loaded equally. Then SCore-D allocates the cluster having faster CPUs.

Each cluster has properties of memory, disk and number of jobs to limit the maxmum number of jobs in the cluster. By default, cluster memory size obtained by the Linux sysinfo(2) systemcall is set, and the disk free space obtained by the Unix statfs(2) systemcall is set by SCore-D at the startup. However, those limit values can be altered via sc_console command. The use of memory and disk limit values will be explained later.

Once cluster(s) are allocated successfully, then SCore-D accepts the user's login, otherwise rejected. When the login is succeeded, the job is placed in a scheduling queue. SCore-D has multiple priority queues. The number of queues depends on the configiration parameter, but the default is three (3). The queue numbered 0 has the highest priority.

Each queue has the status either disabled, enabled but deactivated, or activated, just like the Unix line printer queue. When a queue is disabled, then the queue does not accept any jobs. If a queue is enabled but deactivated, then the queue can accept jobs but not scheduled until it is activated. When a user's job is firstly enqueued, just a queue entry is created. The parallel process(s) created when the entry is scheduled at very first time. So the job submission to the deactivated queue does not mean the creation of a parallel process, and no Unix resource of allocated hosts is consumed.

Each queue has some properties, scheduling mode, maximum cpu time, time to remain in the queue, memory limit, disk limit, and a Unix group ID who is allowed to enqueue jobs to the queue. These resource limitations are unset by default. SCore-D administrator can set appropriate values to the properties of each queue.

The scheduling mode of a queue is either gang (TSS) or exclusive. When a queue is in gang mode, then the jobs in the queue is scheduled in time-shared way (GANG-TSS). When a queue is in exclusive mode, then the jobs are scheduled in an exclusive way, this means that no parallel process switching takes place with the jobs in the same queue and wiht the jobs in the lower priority queues. However, there can be a case in which there exists jobs in a high priority queue, but the jobs in the lower priority queue are scheduled. Remember that SCore-D schedules jobs in space sharing way too. If there are some hosts on which no jobs in high priority queue allocated, and jobs in low priority queue are allocated on the hosts, then the low-prioritized jobs can also be scheduled simultaneously.

Maximum CPU time queue property specifies the maximum CPU time for a job to run. This queue property is inherited by a job when enqueued, and never altered. When a job's CPU time exceeds this limit, then the job is automatically killed by SCore-D. When a job's CPU time exceeds the limit specified by the time remaining queue property, then the job is dequeued and enqueued into the next queue having lower priority, and the time remaining value of the job is reset. With this mechanism, an SCore-D administrator can lower the priority of long running jobs.

The memory and disk limit values of a queue are numbers in percent. The actual limit value is the multiplied value of queue limit percentage and cluster limit value. If user specified the limit(s) as (a) scrun option(s) explicitly, and the user specified value is less than the calculated value, then the user specified value is taken instead. SCore-D assumes that user jobs consume the calculated or explicitly specified limit value. When a job's resource (memory or disk) cunsumption exceeds the limit value, then the job is killed automatically. SCore-D also assumes that the memory and disk limit values of clusters are the upper limits. No job is enqueued when there is not enough resource. For example, if a queue limit is set to 100%, and user does not specify the limit explicitly, then the queue can accept only one job.

Note that this memory and/or disk resource limitation mechanism is only effective when cluster limit and queue limit are both specified. Memory limit and disk limit are independent, though. By default, cluster limit values are set by SCore-D, however, queue limit values are not set. Thus SCore-D administrator must set, at least, queue limit value(s) when he/she wants to set the limit(s). The reason of the memory and disk limit values are in percentage is that cluster hosts may have different amount of memory and disk space. By default, SCore-D samples the free disk space of the file system where /var/scored/ exists. If SCore-D administrator wants to limit disk usage on the other file system(s), then add an attribute watchfs in host record(s) having value(s) of directory name(s) in scorehosts.db file. Currently there is no way to specify limit on individual file system.

When SCore-D is running in multi-user mode, and if the file named scored.rc, located in the current directory and/or install directory, and readable from the SCore-D server host, then the file is (are) assumed to contain a series of SCore-D console commands and it is (they are) read for initial setting.

Note on Resource Low

As described earlier in this document, when resource is low for a job, user's login to SCore-D is rejected. When a user wants to wait until enough resource is available, then wait option of scrun command must be specified.

Note on Exclusive Scheduling

The currently implemented scheduling of exclusive queue is not a starvation free algorithm. This does not mean some jobs really starves, but may have less scheduling opportunity than the others. Consequently, there is a wide distribution of slowdown ratio of submitted jobs. In contrast, gang scheduling is much fairer. The overhead of gang context switch, although depending on the number of hosts and the performance of SCore-D network, is a few milliseconds in most cases.

Note on Memory and Disk Limit

Currently, memory and disk limitations limit the usage of a job. If a job consisting of a parallel piped parallel processes, then the resource usage is the total of the usage of parallel process concurrently running. If a job consisting of serialized parallel processes, then the resource usage is the one of individual parallel process.

The disk usage of a job includes the size of binary file(s) in the job, size of checkpointed process image, and the backup file of a checkpointed image. When a disk limit is set, the value must be larger than the total sum of above file sizes. In addition to that, if a user's job create some temporaly file(s) using sc_create_temporary_file(3) function, the size of the temporaly file(s) must be taken into account.

See Also

SCore-D Console Command, sc_console, scorehosts.db, scrun
PCCC logo PC Cluster Consotium

CREDIT
This document is a part of the SCore cluster system software developed at PC Cluster Consortium, Japan. Copyright (C) 2003 PC Cluster Consortium.