TAU

TAU (Tuning and Analysis Utilities) can be used to help you understand what your program is doing by profiling, tracing and sampling an application. It is able to instrument codes using using a variety of parallel programming models (e.g. serial/no parallelism, MPI, OpenMP, pthreads).

This page is a tour of some of the functionality – please see the TAU website for more details:

http://tau.uoregon.edu/

Setting the module environment

When you log in, you should ensure that your module list matches the one used to compile the program you wish to investigate. At minimum, this should include a compiler module (a set of Intel compilers are loaded by default); if the program uses MPI, it should also include an MPI module (a version of OpenMPI is loaded by default).

Once done, loading the tau module will make available a version appropriate to your selection.

For example, if you would like to use the GNU GCC compilers and Intel’s MPI implementation, use:

This will add the packages appropriate for your chosen environment.

Profiles

Working with TAU generally centres around (optionally) preparing a program for instrumentation, generating a profile during program execution, then examining the profile using a command-line or graphical tool.

By default, only a single metric is collected – the amount of time the application spends in its different phases of execution – but TAU can be asked to collect more, like information on how the CPU is performing during those phases (e.g. the number of data cache misses and floating point operations). This can be configured via the TAU_METRICS environment variable (see below).

If a single metric is collected, the profile consists of a number of files named profile.X.Y.Z files (X, Y, Z are numbers – one file per thread or MPI rank). If several metrics are collected, a set of profile.X.Y.Z files are created per metric and placed into subdirectories (names are prefixed MULTI__ ).

Warning! If you switch between single and multiple metric collection, the tools to examine the profiles can get confused. If you have collected a single metric, ensure you have no directories with names starting with MULTI__ in your current working directory. If you have collected multiple metrics, ensure you have no files named profile.X.Y.Z in your current working directory.

Generating profiles

TAU can be used without recompiling an application, although not all functionality is available unless it is:

mpirun tau_exec <program> Instrument an MPI application from within a job script
tau_exec -T openmp <program> Instrument an OpenMP application
tau_exec -T pthread <program> Instrument a threaded application
tau_exec -T serial <program> Instrument a non-parallel application

(execute tau_exec -h for more details)

By default, this basic method will give enough information to understand how an application splits its time between running the program and the different types of communication it makes. Note that generating a profile of a computationally intensive program is best done via a batch job.

Please see below for details on how to generate a profile of a Python program.

To gain insight into what the application is doing when it isn’t communicating, tau_exec can be asked to collect more information:

  • Additional metrics (see below), e.g. allows you to derive an aggregate speed (Tflop/s) figure for an application
  • Periodically sample the application (see below), e.g. allows it to guess where the application is spending its time. Mostly useful if it has been compiled with -g and the source code is available in the current working directory, the profile will point at specific lines. But it is only a guess.

However, to gain the full functionality of TAU, the program needs to be prepared. If the source code is available, we recommend that the program is recompiled:

  • First, instruct TAU on how to instrument it via the TAU_MAKEFILE and TAU_OPTIONS environment variables (see below)
  • Second, compile the code by replacing the normal compiler command with tau_cc.sh (C), tau_f90.sh (Fortran 90 or later), tau_cxx.sh (C++) or tau_f77.sh (Fortran 77)
  • Third, run the program as normal. The profile will be generated in the current working directory.

If the program source code is unavailable, consider trying TAU’s new capability to rewrite your original program binary to add instrumentation:

  • tau_rewrite <program> <new program name>
  • Run <new program name> in the same way <program> would be normally run. The profile will be generated in the current working directory.

Setting the TAU_MAKEFILE environment variable

When setting this environment variable please consider the method (or lack) of parallelisation used by the program, then set it to point toe one of the Makefiles in the $TAU directory:

TAU_MAKEFILE Functionality
$TAU/Makefile.tau-papi-pdt Instrument serial programs
$TAU/Makefile.tau-papi-mpi-pdt Instrument MPI programs
$TAU/Makefile.tau-papi-pdt-openmp-opari Instrument OpenMP programs
$TAU/Makefile.tau-papi-pthread-pdt Instrument multi-threaded programs
$TAU/Makefile.tau-papi-mpi-pdt-openmp-opari Instrument hybrid MPI/OpenMP programs
$TAU/Makefile.tau-papi-mpi-pthread-pdt Instrument multi-threaded MPI programs

For example, to instrument an MPI job, please execute:

Careful! Like many of the libraries installed on our systems, there is a separate copy of TAU for each compiler and MPI combination. If you change your compiler or MPI modules while you are working with TAU, the TAU environment variable will update accordingly. If TAU_MAKEFILE points to a file within one of our copies of TAU, it will also be automatically updated (unless the TAU_LEAVE_MAKEFILE_ALONE variable is set).

Note: there may be some Makefiles in the $TAU which include the strings icpc or pgi in their filenames. Please ignore them: this is TAU’s way of supporting multiple compilers in a single installation. On our systems, it has been arranged that the above Makefile names always point to the version for the currently loaded compiler module.

Additionally, the following options are available, to generate MPI traces directly into OTF2, a format that vampirtrace understands:

TAU_MAKEFILE Functionality
$TAU/Makefile.tau-papi-mpi-pdt-scorep Instrument MPI programs
$TAU/Makefile.tau-papi-mpi-pdt-openmp-opari-scorep Instrument hybrid MPI/OpenMP programs
$TAU/Makefile.tau-papi-mpi-pthread-pdt-scorep Instrument multi-threaded MPI programs

Note: if you are using the Intel compilers and Makefile.tau-papi-mpi-pdt-openmp-opari does not work properly, an alternative method of instrumenting Intel OpenMP codes, although this should not be required:

TAU_MAKEFILE Functionality
$TAU/Makefile.tau-papi-ompt-pdt-openmp Instrument OpenMP programs
$TAU/Makefile.tau-papi-ompt-mpi-pdt-openmp Instrument hybrid MPI/OpenMP programs

Setting the TAU_OPTIONS environment variable

TAU_OPTIONS is optional and, if set, should be a space-separated list of options. See the TAU documentation for a full list, but common ones are:

no Option Functionality
-optCompInst If TAU is unable to compile the code, set this. Uses compiler-based instrumentation for everything instead of the default, lower-overhead PDT method (PDT method copies, parses and modifies the code, but can get confused)
-optPreProcess Put files through the preprocessor before instrumentation
-optHeaderInst Use if there is code in the program's header files that need to be instrumented

Example:

Python

Python 2.x programs can be profiled with a small amount of effort. First, do not load module tau . Instead, please load a python module and python-libs/2.4.0 > (or later). This will make available a copy of tau built against that version of python. Additionally, you do not need a compiler or MPI module loaded.

The program to be instrumented should be encapsulated in a wrapper, similar to the following:

The definition of OurMain should be filled out with a call to the rest of the python program to be profiled. Executing the wrapper script will generate a TAU profile and paraprof will be able to understand line numbers etc.

If it is an MPI program, for example by use of the mpi4py library, please execute from a job script with:

Here, if <program> is a wrapper, the TAU profile will contain function calls, line numbers, MPI communication details, etc. If it is not, will just contain the MPI communication details.

Examining a profile

Once you have a profile (a set of profile.X.Y.Z files, or MULTI__* directories containing them), it can be examined using the pprof (text-based) and paraprof (X Window based) commands. Although generating profiles is best done through the batch queues, we recommend that they are examined using the login nodes.

pprof will show information in tabular format, but we will concentrate on paraprof here – please ensure you have an X server running on your local workstation, and X forwarding is enabled in the ssh client used to access the HPC machine.

When paraprof is first run, it should detect any single or multiple metric profile in the current working directory, and display the available metrics to examine in the ParaProf Manager, e.g. a profile one of the examples distributed with TAU, run over 4 MPI processes:

The TAU ParaProf Manager
The TAU ParaProf Manager

Double-clicking a metric (e.g. TIME) or right-clicking and selecting Show metric in a new window will show values for that metric during various phases of the application for the application as a whole, or for each MPI rank/thread. Hovering over part of a bar chart will show a floating window with information:

paraprof-metric.png
The TAU ParaProf filesystem view

Clicking on one of the titles on the left hand side (Mean, Max, node 0 etc.) will result in a new window with more detail for the metric for that title, e.g. for node 1:

paraprof-metric-node.png
The TAU ParaProf node view

Right clicking on a bar and selecting Show Source Code takes you to the relevant subroutine/function:

paraprof-metric-node-source.png
The TAU ParaProf Source Browser

To look in greater depth at the communication statistics between threads/ranks, right click on on one of the titles on the left hand side (Mean, Max, node 0 etc.) of the metric window and select Show Context Event Window. If the application has been recompiled with TAU, it will be broken down by function/subroutine (to see what the call path to those communications are, set export TAU_CALLPATH_DEPTH=, where <num> is the number of levels to see).

Altering the information collected in a profile

Collecting additional metrics

By default, TAU collects the TIME metric, which follows what time was spent doing what. Various other metrics can also be collected, the most significant of which are called hardware performance counters via PAPI.

For example, the PAPI_FP_INS metric will collect how many floating point instructions were made. PAPI_L1_DCM will collect how many level 1 CPU cache misses there were – there is normally a large number of these, but a much larger number can indicate inefficient ordering of array indices when looping through multidimensional arrays.

A derived metric of PAPI_FP_INS / TIME can be created to show how fast particular functions are in flops (floating point operations per second). When sorted by TIME, TAU can easily show what parts of the code occupy the most time but are not computationally efficient.

e.g.

Once the profile has been collected, the ParaProf Manager window should show both metrics. Select menu item Options->Show Derived Metric Panel, click PAPI_FP_INS, divide, TIME, then Apply.

Double-click on the new metric just created to open a new window showing the metric. Click on one of the titles down the left hand side (e.g. node 1) to open a detailed window, then select menu item Sort by...->Exclusive...->TIME

Notes regarding PAPI (hardware performance counters)

A list of PAPI counters and brief description can be obtained by the commands:

(note: on some platforms, the papi_avail command will abort with an error, unless the papi module is loaded first)

Sampling (find expensive lines of code)

By enabling sampling, TAU can guess which lines of code are the most significant for a given metric, like time spent, allowing the problem areas of a subroutine or function to be found. This adds various (SAMPLE) bars to the metric, which can be right-clicked and Show source code to highlight the important line.

This option effectively asks TAU to make its best guess on what individual lines of code are significant. Sampling can be enabled by setting the following environment variable prior to generating the profile:

Alternatively, if tau_exec is being used, add its -ebs (event based sampling) option.

This works by TAU periodically waking up and checking what the application is doing at that time.

Instrument only a subset of routines

(TBD)

Instrument loops

(TBD)

Summary of MPI communication patterns

Separate to MPI tracing, TAU can collect some aggregate figures of the non-collective MPI communications made by an application.

To collect, set the following environment variable prior to generating the profile:

To examine this part of the profile, open a metric window in paraprof and select menu item Windows->Communications Matrix

Manipulating profiles

Managing a profile consisting of several files and/or directories can be difficult, especially if you want to move it to a different computer (perhaps with better graphics performance than running it remotely on an HPC machine); however, it can be packed into a single file which makes things easier.

e.g. to create a single file saved_profile.ppk with the whole profile:

The original profile files and directories can then be safely deleted. To load a packed profile into paraprof , supply the filename as an argument, e.g.

It can be useful to collect several profiles over time, perhaps to see how an algorithm is improving over the days or weeks it is being worked on, TAU offers a database to store and compare different profiles, called taudb. Please refer to the TAU website on using it.

MPI Tracing

TAU can create a log of all communications made by an MPI program, for viewing in a package like jumpshot , paraver or vampirtrace .

To create, set environment variable TAU_TRACE (TAU_PROFILE may also be necessary, if you also wish to continue generating the profile files):

Then generate a profile using the normal method, either with tau_exec or with a copy of the application built with TAU. A number of events.X.edf files will be created (X is a number), one per rank. Combine and convert into an SLOG2 file, ready for viewing with the following commands:

Assuming all the functionality works, it should be possible to collect trace information once, then convert to examine in different trace viewer applications:

  • The tau2otf2 utility should be able to convert to a format suitable for commercial package VampirTrace (if available).
  • The tau_convert utility should be able to convert to a format suitable to be examined in paraver (if available)

Alternatively, TAU should be able to create trace files directly into VampirTrace’s format, if you have a tool available that can read it. Set the following environment variable:

Then recompile the application with a Makefile with scorep in the name, and generate a new profile. A new scorep directory will be created, containing the trace.

Install local copy of TAU / jumpshot

Using graphical applications from an ssh session can sometimes be slow, particularly if there is a lot of data to analyse. If this is the case, it may be easier to install the TAU and jumpshot user interfaces on a Linux desktop. This can be done by executing the following commands on the local Linux machine:

The paraprof and jumpshot commands should now work from on your Linux desktop. To make this permanent, please edit your .bash_profile (assuming using the normal default shell) to include a line adding the appropriate directory to the PATH environment variable.