Programming environments

The Keeneland System uses modules to control the programming environment. For instance, there are compilers from several different vendors, each of which may have several releases available. Modules generally interact with your environment variables (eg: $PATH) to choose which compiler is being used, and which libraries are being linked in.

By default, several modules are already loaded:

$ module list
Currently Loaded Modulefiles:
  1) modules
  2) torque/2.5.11
  3) moab/6.1.5
  4) gold
  5) mkl/2011_sp1.8.273
  6) intel/2011_sp1.8.273
  7) openmpi/1.5.1-intel
  8) PE-intel
  9) cuda/4.1

The PE-intel module sets up the environment to use Intel compilers (icc and ifort). The module show command lists all the actions that loading this module will make: in this case, it checks to see whether another programming environment is already loaded, and quits if that is the case. Otherwise, it loads the default version of the Intel compilers, and the version of OpenMPI intended to work with Intel, and sets some generic environmental variables to be used in Makefiles (ie: $CC should always refer to the current C compiler):

$ module show PE-intel

conflict     PE-pgi PE-gnu PE-intel 
module         load intel 
module         load openmpi/1.5.1-intel 
setenv         CC icc 
setenv         CPP icc -E 
setenv         CXX icpc 
setenv         FC ifort 
setenv         F77 ifort 
setenv         F90 ifort 

In order to change compilers, use module swap. Third party libraries may need to be reloaded after this as they often check which programming environment is loaded and set paths accordingly. This changes from Intel compilers to PGI:

$ module swap PE-intel PE-pgi


  • Some third party software (notably DDT) does not work with the new version of CUDA yet, it may still be advisable to give that version a try.
  • The default version of gcc without modules is 4.1.2.User may load other newer version of gcc.

Compiling for CPU

In order to compile non-MPI, non-CUDA code, the compilers may be called directly. Additional documentation is also provided in the man pages for each compiler, for example man gcc.

  C C++ Fortran
Generic $CC $CXX $FC, $F77, $F90
GNU gcc, gcc44 g++, g++44 gfortran, gfortran44
Intel icc icpc ifort
PGI pgcc pgCC pgfortran, pgf90, pgf77


Warning: Optimizing with ifort -fast may currently fail because this requires a static compilation, and there is currently no static version of libnuma available. It is possible to compile with all the options contained in -fast besides -static:

$ ifort -xHOST -03 -ipo -no-prec-div

Compiling MPI

There are two major MPI implementations available via modules. OpenMPI is loaded by default, and MVAPICH2 may be used if preferred. In either case, the MPI compiler wrappers should be used (given in table below). If the environment is set properly, these should call the chosen compiler and link against all the necessary MPI libraries. In order to check what these are doing, pass the -dryrun option, in which case the wrapper will print all of the commands it would normally run.

C C++ Fortran
mpicc mpicxx mpif90, mpif77


CUDA programs and objects should be compiled with nvcc. This will generate code for the GPU, and it will also make calls to a C compiler to generate code for the CPU. To see the calls to the C compiler, you can use the flag -dryrun.

nvcc -dryrun

By default, gcc is used to compile the C code. If another compiler is preferred, specify that with the option -ccbin flag. For example, in order to use Intel's compiler, enter the following:

nvcc -ccbin icc

For some examples of working Makefiles, check out nVIDIA CUDA SDK or try some of the MAGMA test cases.

  • In addition, the modules should define $CPP to be the generic C pre-processor
  • icc will compile C or C++ depending on the suffix of the source code, icpc is the same compiler, but it forces C++
  • Similarly, the Fortran compilers may have several different names, corresponding to the Fortran 77 or Fortran 90 standards. Refer to the man pages for more specifics.

For a trivial example of how to compile and run CUDA, consider


__global__ void helloworld() {
    printf("Hello World!\n"); 

int main(void) {

  helloworld<<< 1, 32 >>>();


  return 0;


Compile like so:

nvcc -arch=sm_21

You would not even need the -arch=sm_21 were it not for the printf statement inside device code!

Run like so:


You will see Hello World! 32 times