What is a Scheduler?

The computational resources on our supercomputers are accessed through a scheduler. The scheduler we use is Sun Grid Engine (SGE) and it is a piece of software that schedules and tracks batch tasks. These tasks are submitted as jobs and just about all programs are run as batch tasks on supercomputers.

The most obvious way to schedule jobs would be to dispatch the jobs in the same order that they arrive in the queue. This is called a first-come, first-served scheduling. While this is simple to implement it has some disadvantages:

  1. a user could monopolise the supercomputer and starve the other users of resources – fair share scheduling rather than first-come, first-served scheduling is the best way to avoid this.
  2. ‘gaps’ may appear in the queue – this is generally known as the packing problem and the scheduler can backfill to adjust for this.

All users start with the same priority for computational resources (cores, nodes and memory) so the fair share scheduler treats new users in a first-come, first-served way for the first job that a user submits. After that a user’s scheduling priority is adjusted and has a half-life of 2 weeks so if they submit no jobs for 2 weeks their priority will be half of what it was when they submitted their last job.

For example imagine that we have a new system with 100 nodes and only 3 users, Jane, Peter and Yuwei. They are all new users with the highest possible priority and the scheduler is implemented to run just across that system. Jane is an Engineer and has relatively large jobs that run for a long time. She submits 3 jobs that are scheduled to run on 70 nodes for 10 hours each. The first job starts running and the other 2 are queued up behind it.

The queue looks like this:

First jobs submitted to a fair share scheduler.
A new user submits 3 large jobs on an empty system to a scheduler that works on a fair share algorithm.

Peter is a computational chemist who submits 2 jobs to run on 40 nodes each for 5 hours and does this an hour after Jane submitted her 3 jobs. Peter has not submitted any jobs so his priory with the fair share scheduler is now better than Jane’s and his jobs are smaller than Jane’s so both of his jobs are scheduled to run before Jane’s second job.
The queue now looks like this:

A second set of jobs are submitted.
A second new user submits 2 medium sized jobs onto a system with just 3 large jobs already there to a scheduler that works on a fair share algorithm.

Yuwei is a bioinformatician who submits 10 jobs for 30 nodes to last 2 hours each and does this 1 hour after Peter. These are small jobs that can fill the gaps in the queue and this is done by the process called backfill.

The queue now looks like this:

A third set of jobs are submitted.
A third new user submits 5 small sized jobs onto a system with just 3 large jobs and 2 medium sized jobs already there to a scheduler that works on a fair share algorithm.

At the University of Leeds there are 2 levels of fair share scheduling which is too complex to be shown in this example. Everyone in a faculty has the same initial priority and are treated in a way similar to that of the above example. However as some faculties provide greater funding than the others some faculties have a greater share. The fair share policy and levels are set by the ARC management group and cannot be altered by the system’s support team.

There are some features of the scheduler, the fair share policy and backfilling that often are noted by the users:

  1. There are 2 levels of fair share on the ARC systems and this means:
    1. The busier your faculty, the more jobs submitted by everyone in the faculty, the longer you have to wait in the queue.
    2. If you submit lots of jobs your subsequent jobs will have a lower priority and you will have to wait longer in the queue and so will the others in your faculty for the next 1 or 2 weeks.
    3. Jobs submitted by people from another faculty might start before yours.
  2. There is no test queue on the ARC service so the best way around this is to submit small test jobs to the queue. The smaller the jobs the less time it will wait in the queue and the less impact it will have on your priority.