Compiling C, C++ and Fortran code¶

On Apocrita we provide a number of compilers and interpreters for popular programming languages. You can use these to build and run your own project code. On Apocrita we also provide programs and software components for you to use, but you can also use the compiler tools to build these for yourself.

This page focuses on the C, C++ and Fortran languages which are the most common compiled languages in use on the cluster. Other documentation pages exist for: Java, Julia, Python, R and Ruby.

Bare Metal vs Spack

These instructions will suffice for most simple cases, but given the subject matter, these will be few and far between. For more portable, reproducible and shareable results, you should consider working through the documentation on custom Spack scopes and Spack environments.

Available compilers¶

A number of compiler suites, each offering C, C++ and Fortran compilers, are available on Apocrita:

GCC
Intel (part of Intel OneAPI)
NVIDIA HPC SDK

Within a compiler suite the provided C compiler is a companion processor to the Fortran compiler in the sense of C interoperability.

The compilers are available via modules. One version of the GCC compilers will be available without loading a module, but this will be an earlier version than offered through the module system. It is preferable to load the module for the latest release of each compiler. Depending on your code and libraries, careful choice of a compilers may provide considerable performance improvements.

Compilation should be performed as job submissions or interactively via qlogin in order not to impact the frontend nodes for other users. One should compile code on the same architecture machines as it will be run on, so the appropriate node selection should be applied to these job requests. One should also ensure that compilation and runtime modules match, otherwise dynamic libraries might complain about mismatched versions.

Loading a compiler module¶

It is generally a good idea to be specific with your compiler version. Check which modules you have loaded to be sure you have the right compiler and that there are no conflicts.

Check the available version for the GCC compiler suite:

$ module avail gcc
gcc/12.2.0

For Intel:

module avail intel
intel-classic/2021.10.0 intel-mkl/2024.1.0 intel-mpi/2021.12.1
intel-tbb/2021.9.0-gcc-12.2.0 intel/2023.2.4 intel/2024.1.0

Intel compiler version 2024.1.0 can be loaded with the command

module load intel/2024.1.0

You can test this by typing the command:

icx -V

This should return a short message reporting the compiler version:

Intel(R) oneAPI DPC++/C++ Compiler for applications running on Intel(R) 64,
Version 2024.1.0 Build 20240308
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.

Often, you will require other libraries and headers that can be found in other modules. Unlike modules which provide many programs and tools, these library modules may be specific to a particular compiler suite. For example, for Open MPI and Intel MPI:

$ module avail openmpi
openmpi/5.0.3-gcc-12.2.0
$ module avail intel-mpi
intel-mpi/2021.12.1

Check your loaded modules with

module list

If you don't specify a particular version, the version marked as default in the output of module avail command will be loaded.

Using the compilers¶

Each of the compiler suites provides a C, C++ and a Fortran compiler. The name of the compiler command varies with the language and the compiler suite. For convenience the compiler suite modules set consistent environment variables by which the compilers may be referenced. The compiler names and variables are given in the following table:

Language	Variable	GCC	Intel	NVIDIA
C	CC	gcc	icx	nvc
C++	CXX	g++	icpx	nvc++
Fortran	FC	gfortran	ifx	nvfortran

As an example, we shall consider the problem of Buffon's needle. If a needle is dropped on a surface of parallel lines, such that the line separation is twice the needle's length, the probability of it crossing the lines is the reciprocal of Pi. Thus, Pi can be estimated with a simple Monte-Carlo integration. We shall run 48 million trials, in order to occupy a CPU for a noticeable length of time.

buffon.f90

A possible implementation in Fortran90 might be:

program buffon
    use iso_fortran_env
    implicit none
    integer(kind=int32) :: trials, hits, i
    real(kind=real64) :: pi_ref, h_pi, rnd, pos, cos_theta, result
    pi_ref = 4.0 * atan(1.0)
    h_pi = pi_ref / 2.0
    trials = 48E6
    hits = 0
    do i = 0, trials
        call random_number(rnd)
        pos = 4 * rnd
        call random_number(rnd)
        cos_theta = cos(pi_ref * rnd - h_pi)
        if (pos .lt. cos_theta .or. pos .gt. 4.0 - cos_theta) hits = hits + 1
    end do
    result = real(trials) / hits
    print "(a,f12.10,f8.3,a)", "Estimated Pi ", result, 100 * result / pi_ref, "%"
end program buffon

For Fortran with the GNU compilers:

$ module load gcc
$ gfortran -o buffon buffon.f90
$ time ./buffon
Estimated Pi 3.1422896385 100.022%

real    0m1.804s
user    0m1.777s
sys     0m0.004s

For Fortran with the Intel compilers:

$ module load intel
$ ifx -o buffon buffon.f90
$ time ./buffon
Estimated Pi 3.1410064697  99.981%

real    0m1.085s
user    0m1.082s
sys     0m0.001s

buffon.c

A possible implementation in C might be:

#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <stdio.h>

int main(void)
{
    int hits, throws;
    float h_pi, pos, cos_theta, result;

    srand (time(NULL));
    throws = 48E6;
    h_pi = M_PI / 2;
    hits = 0;

    for (int i=0;i<throws;i++) {
        pos = (4.0 * rand()) / RAND_MAX;
        cos_theta = cos((M_PI * rand()) / RAND_MAX - h_pi);
        if (pos < cos_theta | pos > 4.0 - cos_theta) hits++;
    }
    result = (float) throws / hits;
    printf("Estimated Pi %12.10f %8.3f%%\n", result, 100 * result / M_PI);
    return 0;
}

For C with the GNU compilers, remembering to link the standard maths libraries:

$ module load gcc
$ gcc -o buffon buffon.c -lm
$ time ./buffon
Estimated Pi 3.1412234306   99.988%

real    0m2.088s
user    0m2.082s
sys     0m0.002s

For C with the Intel compilers:

$ module load intel
$ icx -o buffon buffon.c
$ time ./buffon
Estimated Pi 3.1420614719  100.015%

real    0m1.499s
user    0m1.493s
sys     0m0.003s

Deprecated Intel compilers and MPI wrappers¶

Intel oneAPI has deprecated the icc, icpc, and ifort compilers, but they are still available by loading the intel/2023 module.

Using GPU nodes with OpenMP¶

On Apocrita we support offloading to GPU devices using OpenMP with GCC. If you have access to the GPU nodes you can compile and run appropriate OpenMP programs, such as those using the target construct, as described below.

OpenMP device offload with GCC compilers¶

OpenMP target offload should be automatically enabled when OpenMP compilation is selected with the -fopenmp compiler option. For example to compile the source file offload-example.c which uses the target construct, you can use:

module load gcc/12.2.0
gcc -fopenmp offload-example.c

The option -foffload=-lm is required to support the maths library on the target device. If you see an error message like

unresolved symbol sqrtf
collect2: error: ld returned 1 exit status
mkoffload: fatal error: x86_64-pc-linux-gnu-accel-nvptx-none-gcc returned 1 exit status
compilation terminated.

then you will need to provide this option when compiling.

Although it is not necessary to compile the code on a GPU node to enable GPU offload, using the node on which you wish to run is advised when compiling.

An OpenMP program compiled with offload enabled can be run in the same way as with other programs. Offload happens automatically if a GPU is available when a target construct is entered.

To disable offload so that the code with a target construct is run on the CPU host instead of the GPU device, compile the program with -foffload=disable. Equally, the code can be compiled without the -fopenmp option if OpenMP is not required.

libgomp loader warnings on non-GPU nodes

If you run an OpenMP program with offload target regions on a node without a GPU you may see a warning like:

libgomp: while loading libgomp-plugin-nvptx.so.1: libcuda.so.1: cannot open shared object file: No such file or directory

These warnings occur because we provide a single compiler build to work on all node types. Compiling programs with -foffload=disable will not avoid such warnings. However, affected parallel regions will still run on the host CPU and the warnings can be safely ignored.

Build systems¶

Typically, software for Linux comes with a build system with one of two flavours: GNU Autotools and CMake. Each of these typically uses the Make tool at a lower level.

On Apocrita the GNU Autotools system can be used without loading a module, although it may be necessary to load an autotools-archive module to support some additional macros. To use CMake it is necessary to load a cmake module.

For a project using GNU Autotools the general steps to build are as follows:

./configure [options]
make

First one runs a configuration command which creates a Makefile. One then runs the make command that reads the Makefile and calls the necessary compilers, linkers and such.

CMake is similar but as well as supporting Makefiles, it can also configure the build system using Visual Studio Projects, OSX XCode Projects and more. Such projects can be identified by the presence of a CMakeList.txt file.

GNU Autotools and CMake support out-of-tree source builds. Put another way, one can create a binary and all its associated support files in a directory that is not the same as the one with the source files. This can be quite advantageous when working with a source management tool like Git or SVN or when building the project supporting several different configurations, such as for debugging or targeting different node types.

To work with CMake with an out-of-tree build, start with creating a build directory in a different location:

$ pwd
/data/home/abc123/MySourceCode
$ mkdir ../MySourceCode_build
$ cd ../MySourceCode_build
$ cmake ../MySourceCode

Essentially, you enter the build directory and call cmake with the path to your CMakeList.txt file. If you wish to re-configure your build, you can use the program ccmake.

The end result is a Makefile. So to complete your build you type:

make

just as you would with the GNU Autotools setup.

Similarly, to use an out-of-tree build with GNU Autotools:

$ pwd
/data/home/abc123/MySourceCode
$ mkdir ../MySourceCode_build
$ cd ../MySourceCode_build
$ ../MySourceCode/configure

To learn more about GNU Autotools, CMake, and Makefiles follow the links below

Optional libraries for HPC¶

MPI¶

The Message Passing Interface is a protocol for parallel computation often used in HPC applications. On Apocrita we have the distinct implementations Intel MPI and Open MPI available.

The module system allows the user to select the implementation of MPI to be used, and the version. With Open MPI, as noted above, one must be careful to load a module compatible with the compiler suite being used.

To load the default (usually latest) Intel MPI module:

module load intel-mpi

To set up the Open MPI environment, version 5.0.3, suitable for use with the GCC compiler suite:

module load openmpi
module load gcc

For each implementation, several versions may be available. The default version is usually set to the latest release: an explicit version number is required to load a different version.

Default module for Open MPI

The Open MPI modules have a default loaded following the command module load openmpi which is openmpi/5.0.3-gcc-12.2.0. This default module is specific to the GCC compiler suite and so to access an MPI implementation compatible with a different compiler suite a specific module name must be specified.

To build a program using MPI it is necessary for the compiler and linker to be able to find the header and library files. As a convenience, the MPI environment provides wrapper scripts to the compiler, each of which sets the appropriate flags for the compiler. The name of each wrapper script depends on the implementation and the target compiler.

Open MPI¶

For each Open MPI module, and the implementation provided by the NVIDIA compiler suite module, the wrapper scripts are consistently named for each language. These are given in the table below:

Language	Script
C	mpicc
C++	mpic++
Fortran	mpif90

buffon_mpi.f90

A possible MPI implementation in Fortran90 might be:

program buffon
    use iso_fortran_env
    use mpi
    implicit none

    integer(kind=int32) :: trials, local, hits, i
    integer(kind=int32) :: rank, mpisize, mpierr
    real(kind=real64) :: pi_ref, h_pi, rnd, pos, cos_theta, result

    call MPI_INIT(mpierr)
    call MPI_COMM_SIZE(MPI_COMM_WORLD, mpisize, mpierr)
    call MPI_COMM_RANK(MPI_COMM_WORLD, rank, mpierr)

    pi_ref = 4.0 * atan(1.0)
    h_pi = pi_ref / 2.0
    trials = 48E6 / mpisize
    local = 0

    do i = 0, trials
        call random_number(rnd)
        pos = 4 * rnd
        call random_number(rnd)
        cos_theta = cos(pi_ref * rnd - h_pi)
        if (pos .lt. cos_theta .or. pos .gt. 4.0 - cos_theta) local = local + 1
    end do

    call MPI_Reduce(local, hits, 1, MPI_INTEGER, MPI_SUM, 0, MPI_COMM_WORLD, mpierr)

    if (rank .eq. 0) then
        result = real(trials * mpisize) / hits
        print "(a,f12.10,f8.3,a)", "Estimated Pi ", result, 100 * result / pi_ref, "%"
    end if

    call MPI_FINALIZE(mpierr)
end program buffon

Returning to the Buffon's needle example, a Fortran MPI program may be compiled and run on Open MPI with the requested number of cores with:

$ echo $NSLOTS
4
$ module load openmpi/5.0.3-gcc-12.2.0
$ mpif90 -o buffon_mpi buffon_mpi.f90
$ time mpirun -np $NSLOTS ./buffon_mpi
Estimated Pi 3.1422548294 100.021%

real    0m1.874s
user    0m2.408s
sys     0m0.735s

(Any detailed discussion of the MPI bindings is beyond the scope of this document, but sharing the trials across multiple processes with MPI_Reduce is an obvious approach. See the courses offered by HPC-UK and tier-2 facilities such as Archer2.)

The Open MPI wrapper scripts provide an option -show which details the final invocation of the compiler:

$ module load openmpi/5.0.3-gcc-12.2.0
$ mpif90 -show -o hello hello.f90
gfortran -o hello hello.f90 ...

buffon_mpi.c

A possible MPI implementation in C might be:

#include <stdlib.h>
#include <mpi.h>
#include <time.h>
#include <math.h>
#include <stdio.h>

int main (int argc, char** argv)
{
    int rank, mpisize, hits, local, throws;
    float h_pi, pos, cos_theta, result;

   MPI_Init(&argc, &argv);
   MPI_Comm_size(MPI_COMM_WORLD, &mpisize);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);

   srand (time(NULL));
   throws = 48E6 / mpisize;
   h_pi = M_PI / 2;
   local = 0;

   for (int i=0;i<throws;i++) {
       pos = (4.0 * rand()) / RAND_MAX;
       cos_theta = cos((M_PI * rand()) / RAND_MAX - h_pi);
       if (pos < cos_theta | pos > 4.0 - cos_theta) local++;
   }

   MPI_Reduce(&local, &hits, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

   if (rank == 0) {
       result = (float) throws * mpisize / hits;
       printf("Estimated Pi %12.10f %8.3f%%\n", result, 100 * result / M_PI);
   }

   MPI_Finalize();
   return 0;
}

To compile and run with the appropriate number of cores in C on Open MPI, remembering to link the standard maths libraries:

$ echo $NSLOTS
4
$ module load openmpi/5.0.3-gcc-12.2.0
$ mpicc -o buffon_mpi buffon_mpi.c -lm
$ time mpirun -np $NSLOTS ./buffon_mpi
Estimated Pi 3.1413266659   99.992%

real    0m1.381s
user    0m2.814s
sys     0m0.803s

No Open MPI module is provided for use with the NVIDIA compiler suite. Instead, the installed NVIDIA compiler environment provides an Open MPI implementation and the NVIDIA compiler module contains the appropriate settings:

$ module purge; module load nvidia-hpc-sdk/24.5
$ type mpif90
mpif90 is /share/apps/rocky9/general/apps/nvidia-hpc-sdk/2024_245/Linux_x86_64/24.5/comm_libs/mpi/bin/mpif90

Intel MPI¶

In contrast, the Intel MPI implementation supports both the Intel and GCC compiler suites in the same module. As with Open MPI wrapper scripts are provided, but these wrapper script names depend on the target compiler suite as well as the language. The wrapper script names are as in the following table:

Language	Compiler suite	Script
C	GCC	mpicc
C	Intel	mpiicx
C++	GCC	mpicxx
C++	Intel	mpiicpx
Fortran	GCC	mpifc
Fortran	Intel	mpiifx

Compiling and running the MPI version of the Buffon's needle Fortran code for Intel MPI and Intel compilers:

$ echo $NSLOTS
4
$ module load intel intel-mpi
$ mpiifx -o buffon_mpi buffon_mpi.f90
$ time mpirun -np $NSLOTS ./buffon_mpi
Estimated Pi 3.1396319866  99.938%

real    0m1.455s
user    0m2.376s
sys     0m0.465s

Compiling and running the MPI version of the Buffon's needle C code for Intel MPI and Intel compilers:

$ echo $NSLOTS
4
$ module load intel intel-mpi
$ mpiicx -o buffon_mpi buffon_mpi.c
$ time mpirun -np $NSLOTS ./buffon_mpi
Estimated Pi 3.1414599419   99.996%

real    0m1.171s
user    0m2.280s
sys     0m0.503s

Mixing Intel MPI with GNU compilers

In general we recommend that Intel compilers are used with Intel MPI, and GNU compilers with Open MPI. While mixing Intel MPI with gcc works for C:

$ module load gcc intel-mpi
$ mpicxx -o buffon_mpi buffon_mpi.c

Currently, mixing Intel MPI with GNU Fortran does not:

$ module load gcc intel-mpi
$ mpifc -o buffon_mpi buffon_mpi.f90

Deprecated MPI wrappers¶

Intel has also deprecated the MPI wrappers that go with the deprecated compilers: mpiicc, mpiicpc, mpiifort have been retired in favour of mpiicx, mpiicpx, mpiifx.

The scripts can be used as in the Open MPI example above:

$ module load intel-mpi
$ mpifc -show -o hello hello.f90
gfortran -o 'hello' 'hello.f90' ...
$ mpiifx -show -o hello hello.f90
ifx -o 'hello' 'hello.f90' ...

Matching versions of Intel MPI and Intel compiler

In general we recommend that, when using Intel MPI with the Intel compilers, you match the versions of the modules. However, there are times where it is necessary or desirable to use a different version of Intel MPI. In these cases you should load the Intel MPI module after loading the compiler module.

There is no support for the NVIDIA compilers in the Intel MPI implementation.

Compiling and testing¶

If make succeeds, you should see various calls being printed on your screen with the name of the compiler you chose. If compilation completed successfully you should see a success message of some kind, and an executable appear in your source or build directory.

Quite often, software comes with test programs you can also build. Often, the command to do this looks like the following:

make test

Optimisation¶

Software optimisation comes in many forms, such as compiler optimisation, using alternate libraries, removing bottlenecks from code, algorithmic improvements, and using parallelisation. Using processor-specific compiler options may reduce universal compatibility of your compiled code, but could yield substantial improvements.

The Intel, NVIDIA and GCC compilers may give different performance depending on different libraries or processor optimisation. Benchmarking and comparing code compiled with each compiler is recommended.

Profiling tools¶

Once you have a running program that has been tested, there are several tools you can use to check the performance of your code. Some of these you can use on the cluster and some you can use on your own desktop machine.

perf¶

perf is a tool that creates a log of where your program spends its time. The report can be used as a guide to see where you need to focus your time when optimising code. Once the program has been compiled, it should be run through the record subcommand of perf:

perf record -a -g my_program

where my_program is the name of the program to be profiled. Once the program run a log file is generated. This log file may be analysed with the report subcommand of perf. For example, to display the function calls in order of the most called:

perf report --sort comm,dso

More information on perf can be found at this Profiling how-to and this extensive tutorial

valgrind¶

valgrind is a suite of tools that allow you to improve the speed and reduce the memory usage of your programs. An example command would be:

valgrind --tool=memcheck <myprogram>

Valgrind is well suited for multi-threaded applications, but may not be suitable for longer running applications due to the slowdown incurred by the profiled application. In addition, there is a graphical tool which is not offered on the cluster but will work on Linux desktops. There is also an extensive manual. However, there are serious issues with using Valgrind with modern AVX/AVX2/AVX-512 architectures and GCC and Open MPI. If using Intel compilers is an option, we'd recommend the valgrind/3.20.0-intel-oneapi-mpi-2021.12.1-oneapi-2024.1.0 module.

Python profiling tools¶

The above tools work best for compiled binaries. If you are writing code in Python, cProfile and line_profiler are useful options. Optimizations for slow-running Python code include parallelisation with multiprocessing or dask to use multiple cores efficiently, and compilers such as pythran or numba. For more details, High Performance Python by Micha Gorelick and Ian Ozsvald is available to QMUL staff and students.