SGE: Examples with Extracts of Submission Scripts

Introduction

This page contains a number of simple batch submissions scripts. These can be used to submit jobs via the command:

The examples below are all independent of the executing shell. The default executing shell is the borne shell (/bin/sh). If you require a different shell (e.g. /bin/csh) then this can be specified in a #!/bin/csh directive at the top of the script.

Directives to the batch scheduling system must be preceded by #$ , so for instance to specify the current working directory add #$ -cwd to your script.

General queue settings

There are several default settings for the batch queue system:

  • The runtime must be specified otherwise jobs will be rejected.
  • The maximum runtime of all queue is 48 hours and no default value is set.
  • Unless otherwise specified, the default 1GB/process or (1GB/slot) is defined for all jobs.
  • Unless otherwise specified, jobs are executed from the user’s home directory, and by default output is also directed to the user’s home directory. The option -cwd can used to run and direct output from the current working directory, i.e. directory from which the job is submitted.
  • Environment variables, set up by modules and license settings for example, are not exported by default to the compute nodes. So, if not using the option to export variables ( -V ) modules will need to be loaded within the submission script.

Note that with all of these scripts, it is possible to specify a combination of resources that are not available on the system (for example, requesting more memory than is available on a node). These jobs will not run and will simply wait in the queue until such time as the resource becomes available (ie. never…).

If your job does not start within a suitable timeframe, please check your script and consult us if you feel there is a problem.

Serial jobs

Simple serial job

To launch a simple serial job, serial_prog for instance, at the very least you must specify the runtime. For example for a job to run in the current working directory ( -cwd ), exporting variables ( -V ) for 1 hour.

More memory

The default allocation is 1GB/slot, to request more memory use -l h_vmem option. For example to request 1500M memory:

Remember that ARC2 has a maximum available of 32GB per slot and ARC3 a maximum of 128GB per slot (768GB on the ‘large-memory’).

Task arrays

To run a large number of identical jobs, for instance for parameter sweeps or using a large number of input files, it is best to make use of task arrays. The batch system will automatically set up the environment variable $SGE_TASK_ID to correspond to the task number, and input and output files are indexed by the task number. For instance running tasks 1 to 100:

SGE_TASK_ID will take the values 1,2, …,100.

Parallel jobs

Shared memory

Shared memory jobs should be submitted using the -pe smp <cores> flag. The number maximum number of cores that can be requested for shared memory jobs is limited by the number of cores available in a single node (. Note that the OMP_NUM_THREADS environment variable is automatically set to the requested number of cores by the batch system. To run a 16 process shared memory job, for 1 hour:

So, for ARC2, the maximum value for -pe smp is 16, for ARC3 it is 24

Larger-Shared memory

To request more than the default 1GB/process memory, use the -l h_vmem flag. For example to request 8 processes and 3500M/process :

Please note that as ARC2 is comprised of 16 core nodes with a total of 32 GB, this can be thought of as 2GB/core.

MPI jobs

On the ARC systems a number of locally developed patches are applied to the batch system which gives a much more effective way of submitting MPI jobs .

Node syntax (preferred method)

The preferred method for large jobs to request the required number of cores is using the -l np flag. This will ensure that jobs are given exclusive use of entire nodes. In addition, nodes will be allocated in order to minimise latency, by giving jobs the best possible placement. For example to request 64 processes for 1 hour:

Using this syntax will allocate exclusive use of nodes, with all available memory in the node. In case of standard nodes on ARC2 the above will give 4 nodes, each with 16 cores and 32GB of memory, or 2GB/core.

Alternatively, you can explicitly request the number of nodes. For example to request 4 nodes, and use all cores available in nodes, use:

In case of the default nodes this will give 64 cores and all the memory available in those nodes, i.e. 2GB/core. To request 8 nodes, and 8 processes per node use:

This will give provide exclusive use of 8 nodes, with 8 cores per node and all the memory, i.e. 64 cores with 4GB/core. However, please note that in the output of qstat will reflect that the job is occupying 8 full nodes, i.e. 128 cores.

As your jobs are using all the available memory with 8 (out of 16) cores in each node, the remaining nodes will not have jobs allocated to them.

MPI Standard SGE syntax

The standard SGE submission syntax is also available, and is better suited to smaller jobs. To request 64 processes for 1 hour:

This will allocate 64 processes, not necessarily on the same node but will guarantee a minimum of 2:1 blocking.

Larger-memory MPI jobs

When using the node syntax by default the job is allocated all memory in a node, and will not usually need to adjust the memory requirement.

When using the standard SGE syntax more memory can be requested by using the -l h_vmem flag. For example to ask for 64 cores with 4Gb/core for 1 hour:

Mixed Mode Programming

The batch system also supports mixed mode (MPI+OpenMP) programming.

Non-optimal topology process placement

Using -l placement=scatter will ignore infiniband topology and could reduce your job wait time (as your job will be allocated to nodes anywhere on the machine). However, this might be at the expense of code performance due to using a network topology which has not minimised the number of switch hops, so increasing latency.