The Scheduler: SGE

Introduction

When you first log in, you will be directed to one of a small number of login nodes. These allow regular command line access to the system which is necessary for the setup of runs, compiling code and some analysis work. Login nodes are shared among all who are logged in and therefore they can very quickly become overloaded.

The compute power behind the system is accessible through the scheduler, a batch submission system. When a job executes through the batch system, processors on the back-end are made available exclusively for the purposes of running the job.

The batch queue system installed is Son of Grid Engine, plus locally developed and implemented patches.

To interact with the batch system the user must request resources that are sufficient for their needs. At a minimum these are:

  • how long the job needs to run for
  • on how many processors (assumed 1 unless otherwise told)

With this information, the scheduler is able to dispatch the jobs at some point in the future when the resources become available. A fair-share policy is in operation to guide the scheduler towards allocating resources fairly between different faculties.

This fair-share policy takes into account both an individual user’s past usage and usage of a faculty as a whole. Essentially, this means that a user with recent (the last 7 days) heavy usage will have their jobs reduced in priority to allow other user’s jobs to run.

Faculty shares are allocated on the basis of funding; faculties do not have equal shares of system capacity.

Resource Reservation and Backfill

By default all jobs are eligible for resource reservation, in that the scheduler will ensure the highest priority jobs will have their start times booked in the future. The qsched -a command can be used to generate a list of the anticipated start times of these jobs. At the moment, only the top 128 jobs are considered for resource reservation. The system will backfill jobs if they will start and finish before the highest priority jobs are scheduled to start. Therefore indicating a realistic runtime for a job (rather than the queue maximum) will make short jobs eligible to be backfilled, potentially shortening their wait-time.

There is also a facility to book an amount of HPC resource for some time in the future, through advance reservation. Jobs eligible to run in that reservation can then be submitted to run within it. Advance reservation is not enabled for users by default, however these reservations can be enabled upon request provided there is a valid case for their use and the fairness policies allow it.

Queue Configuration

Currently the facility is configured with a single general access queue, allowing submission to all available compute resources. Thus, there is no need to specify a queue name in job submissions.

Time Limits

Jobs requesting a time up to the maximum runtime of the queue are eligible to be run. At the moment the maximum runtime is 48 hours.

Should a job run beyond the length of time requested, it will be killed by the queuing system. To change the time requested by a batch job, change the time specified in the -l h_rt flag e.g.:

Will request six hours of runtime.

Memory Usage

In order that programs do not compete for the available memory in a machine, memory usage is consumable. This helps ensure that if one job is consuming 100GB memory on a node that has total of 128GB memory, the maximum total size of all other jobs which are allowed to execute on that system is 28GB.

By default, a 1GB per process (or 1GB per slot) limit is defined for all batch jobs. To override this behaviour use the -l h_vmem switch to qsub . E.g. to run a 1 process code using 6GB of memory for 6 hours:

As memory is specified per slot:

Will request a total of 8GB of memory, shared between 4 processes.

Jobs will be run on nodes, provided that the total memory requested per node does not exceed the physical memory of that node. Please note that if a job requests more memory than is physically available the job will not run though it will still show up in the queue. If an executing program exceeds the memory it requested, it will be automatically terminated by the queuing system.

Job Submission

The general command to submit a job with the qsub command is as follows:

where script_file_name is a file containing commands to executed by the batch request.

For commonly used options and more details about qsub please look at our Qsub page.

For example submission scripts please look at these script examples.

Submitting Shared-Memory Parallel Jobs

Shared memory parallel jobs are jobs that run multiple threads or processes on a single multi-core machine. For instance OpenMP programs are shared memory parallel jobs.

There is a shared memory parallel environment (pe) called smp that is set up to enable the submission of these type of jobs. The option needed to submit this type of job is:

For example:

will request 4 processes in a shared memory processor running for 6 hours.

Distributed Parallel Jobs with the Node Syntax

This type of parallel job runs multiple processes over multiple processors, either on the same machine or more commonly over multiple machines.

A significant change made to the batch system on ARC2 is that in addition to the standard Grid Engine submission syntax, an alternative “nodes” syntax has also been implemented. This is designed to give jobs dedicated access to entire nodes. This should provide more predicable job performance, for instance due to placement and dedicated use of Infiniband cards as well as providing a more flexible specification of processes or threads for mixed-mode programming.

It can take either of the following forms:

Where:

w number of nodes requested
x number of processes requested
y number of processes per node (rewrites MPI hostfile to this)
z number of threads per process (sets OMP_NUM_THREADS to this)
If y and z are omitted, Grid Engine sets y = number of cores in each machine and z = 1 .
If y is present and z omitted, Grid Engine sets z = int(num_cores / y) .
If z is present and y omitted, Grid Engine sets y = int(num_cores / z) .

If using this syntax, the amount of memory available to the job on each node is automatically set to the node_type specification using the node_type flag.

Guide to the Nodes on ARC3
Node Type Number of nodes Memory Node type flag
Standard 165 128GB 24core-128G (default)
High Memory 2 768GB 24core-768G
GPGPU 2 – each with 2 NVIDIA K80 GPUs 128GB N/A

An example of how to use the node_type flag is given here:

These options will allocate whole numbers of nodes for a particular job; other jobs will not share the resources.
For example,

and

will both request exclusive access to a whole node, using just 4 of its cores.

As there is 32GB memory per node, this will therefore allocate 8GB to each core.

These options also support mixed mode (MPI+OpenMP) programming.

Submitting distributed parallel jobs with the standard SGE syntax (ARC1 and ARC2)

In addition, the standard Grid Engine method for requesting the number of cores is applicable via use the parallel environment, in this instance pe ib . So the option needed would be:

Querying queues

Once you have submitted your job to the queue, you can check on its status using the qstat command. You will see a report displayed something like this:

You can refine your request to obtain some additional information:

Switch Action
-help Prints a list of all options
-f Prints full display output of all queues
-g c Print a ‘cluster queue’ summary – good for understanding what resources are free, across different queue types
-g t Print ‘traditional’ output, i.e. print a line per queue used, rather than a line per job
-u Displays all jobs for a particular username

The switches are documented in the man pages; for example, to check all options for the qstat command type:

By default, users will only see their own jobs in the qstat output. To see all jobs use a username wildcard:

The state column will indicate the progress of your job:

code meaning
qw waiting in the queue to start
t transferring to a node (about to start)
r running
h job held back by user request
E error

Job deletion

If you want to remove a job from the queues, you can use the qdel command (perhaps there’s a bug in your code or you don’t need to run it anymore).

First, run qstat to get the job-ID of your job (it’s the leftmost column in the example above). Then:

To force this action for running jobs:

A user can delete all their jobs from the batch queues with the command:

Altering the status of a job in the queue

It is possible to do this, but only with jobs that are still queuing.

If your job is currently running, simply delete and resubmit.

If the job is still queued, you can change some of the parameters:

Get the <jobid> by running the qstat command (it will be the left hand column in the table displayed).

Find all the l parameters from your submission script:

An example result might be:

Change the parameter you want, so here change the runtime from the initial 4 hrs to (say) 6 hours and then use the qalter command to alter the job entry in the queue: