Tutorial of Parallel Computing

OpenMP

Fortran and C OpenMP example codes can be found in ~/GS/*/OpenMP , where * represents the compiler of your choice (either Fortran or C). The program takes an image that has undergone edge detection and reconstructs the original image iteratively. The program is initially set to perform 1000 iterations. Compile and run the code using full optimisation:

For the Intel compiler, for Fortran use:


$ ifort -O2 -xSSE4.2 -openmp -o image image.f90 

or for C use:


$ icc -O2 -xSSE4.2 -openmp -o image image.c 

Time the execution of this code on a single processor, launching under time. A measure of convergence is given as the final residue figure approaches zero. The reconstructed image is placed into the file finalimage.pgm which can be viewed with the graphics program display .

The number of parallel threads is set through the $OMP_NUM_THREADS variable. Increase the number of parallel threads to 2 and rerun the code. To set the number of threads to 2 via the environmental variable use:


$ export OMP_NUM_THREADS=2

and to unset/delete the variable use:


$ unset OMP_NUM_THREADS

NOTE: When running the Intel compiled Fortran code on the login node, you may get a Segmentation fault. This is due to the way that the compiler handles the stack when using OpenMP. You can resolve this issue by setting the stack to an unlimited value, using ulimit -s unlimited on the command line. This does not happen when submitting via the queues as unlimited is the default value of the stack on the computational nodes.

To compile the above code with the Portland Group Fortran compiler:

 
$ pgf90 -fastsse -mp -o image image.f90 

or C:


$ pgcc -fastsse -mp -o image image.c 

OpenMP job submission

Below is a script that will launch an OpenMP parallel job:


#!/bin/sh 
# Export variables and use current working directory 
#$ -cwd -V 
# Request 1 hour of runtime 
#$ -l h_rt=1:00:00 
# Request 4 CPU cores (processes) 
#$ -pe smp 4 
./image 

Use this script to obtain accurate timing information for running this code on 1, 2, 4, 6 and 8 CPU cores for both the Intel and Portland group compilers.

MPI

MPI is available via wrapper scripts which call the relevant compiler, together with necessary include files and library calls. There are different wrapper scripts available depending upon the choice of compiler and MPI library.

The MPI wrapper scripts, take the form mpif77 (Fortran 77), mpif90 (Fortran 90), mpicc (C) and mpiCC (C++).

Compiling MPI

All compiler options applicable to the compiler being invoked are available to the wrapper scripts. Use the MPI wrappers to compile the code and link with the standard MPI library; for Intel Fortran:


$ mpif90 -o pip pip.f90 

or for C:


$ mpicc -o pip pip.c 

To launch the code, use the mpirun launcher. This takes an option -np <n> , where n is the number of processes to be launched. Execute and time the code for 1,2,4 processes; e.g. for 2 processes use:


mpirun -np 2 ./pip

The output from this program will be placed into the file output.pgm which can be viewed with the display command. E.g.:


display output.pgm

To use the PGI compilers, first switch the PGI module module switch intel pgi . Then for PGI Fortran use:

 
$ mpif90 -fastsse -o pip pip.f90 

or for C:


$ mpicc -fastsse -o pip pip.c 

You can then run the program in the same way as described above

MPI Job Submission

The system is set up, such that the job will placed in an optimal fashion by default. This means that the cores will be selected from the available resources in such a way to minimise the communication hops between the different processes. This will have the effect of reducing latency and should improve program performance.

Below is a script that will launch a MPI parallel job:


#!/bin/sh 
# Use borne shell  
# Export variables and use current working directory 
#$ -cwd -V 
# Request 1 hour of runtime 
#$ -l h_rt=1:00:00 
# Request 4 CPU cores (processes) 
#$ -pe ib 4 
mpirun ./pip

Use this script to obtain accurate timings for 1, 2, 4, and 8 processes. If you have time, repeat the exercise with the PGI compiler as well as the Intel compiler.