Using the GPU nodes

ARC3 has a number of NVIDIA GPU nodes which can be used independently or in combination.

There are two K80 nodes, which contain:

  • Two NVIDIA K80 GPU cards
  • Two 12-core CPUs (24 cores in total)
  • 128G memory

There are six P100 nodes, which contain:

  • Four NVIDIA P100 GPU cards
  • Two 12-core CPUs (24 cores in total)
  • 256G memory

Programming NVIDIA GPUs

A variety of methods may be used to program an NVIDIA GPU. Some may have more potential for performance, while others may be easier to program – or add to an existing code.

There are broadly two categories: native (writing code specifically tailored for GPUs) and annotating existing code (such that the program, or parts of it, can run on either a CPU or a GPU).

Native – CUDA/OpenCL

The native method of programming an NVIDIA GPU is using a set of tools called CUDA.

CUDA contains its own dialect of the C programming language and a number of optimised libraries for a variety of purposes. Although optimised for NVIDIA cards, programming in CUDA means that the code will not be able to run on GPUs from other manufacturers (e.g. AMD) or on a computer without a GPU available.

Alternatively, there is a library called OpenCL that provides an agreed standard C interface to GPUs in general (and more exotic hardware such as FGPAs from one manufacturer). This does not require using a special dialect of C, is portable between different GPUs, but it is unclear how well supported it is on NVIDIA GPUs.

Due to the very large amount of parallelism available within a GPU, programming well in CUDA or OpenCL typically involves adopting an asynchronous model where the programmer overlaps multiple data transfers between the CPU and the GPU and multiple different sets of calculations – all in the aim of maximising memory bandwidth and keeping all those CUDA cores busy. It also involves paying close attention to use of the various caches, by reusing memory that has been recently used.

It is these methods that can make use of the full potential of GPU performance; however, it clearly represents a large investment in time and effort.

Code annotation – OpenACC/OpenMP

An ordinary C, C++ or Fortran code may be made to run on a GPU by using a compiler with that capability and if the code is annotated with special Fortran comments or C/C++ pragmas giving the compiler information on how the code can be parallelised and what memory is needed.

This can be done with two open standards – either OpenACC, or with OpenMP (which can also parallelise code to run on a multi-core CPU).

Although it is unlikely that these methods will allow GPU’s full performance to be realised, it is far simpler from a programmer’s perspective than programming directly in CUDA or OpenCL.

GNU GCC 7 and higher is installed with NVIDIA GPU OpenACC and OpenMP support on our services containing GPUs. We are still evaluating it, so please don’t rely on this functionality without letting us know that you are using it.

Details on GCC’s support can be found here:

OpenACC notes and example

Note: avoid using the OpenACC directive acc kernels with GCC as it isn’t very good at automatically parallelising code – please preferentially use acc parallel loop

This is an example code:

Compile with:

gcc -fopenacc -o prog prog.c

OpenMP notes and example

This is an example code:

Compile with:

gcc -fopenmp -o prog prog.c

Python

Profiling code running on GPUs