CUDA

 

This page is under construction.

 

CUDA with MPI

Most people program CUDA using its Runtime API.  With this API, CUDA kernels are called using "triple chevron" syntax.  For example, a Runtime API kernel call might look like

  Reduce<<<nblocks,nthreads>>>(idata, odata,size);

Runtime API code must be compiled with using a compiler that understands it, such as NVIDIA's nvcc. Using nvcc to link a CUDA program is also nice, in that it automatically finds the right CUDA libraries to link against.

Similarly, for a non-CUDA MPI program, it is easiest to compile and link MPI code using the MPI compiler drivers (e.g., mpicc) because they automatically find and use the right MPI headers and libraries.

However, if you want to compile and link a CUDA program that also contains calls to MPI functions, there is a problem that may arise.  Source files that contain both Runtime API (triple chevron) kernel launches and calls to MPI functions should be compiled by either nvcc or mpicc, but they cannot be compiled by both.  Likewise, when linking a program that contains both CUDA code and MPI code, one of the two compiler drivers should be used for the link.

There are two primary approaches for addressing this problem:

  • Let CUDA win
    • Determine the flags needed to compile and link an MPI program. Some MPI implementation's mpirun command supports a 'showme' flag that can help determine these flags.
    • Compile and link everything with nvcc, passing the necessary MPI flags.
    • Note that in our experience it is difficult to use this approach with anything other than the GNU compiler.
  • Let MPI win
    • Separate the code that uses the CUDA Runtime API (triple chevron) kernel launches from code that uses MPI
    • Compile source files with the triple chevron kernel invocations using nvcc
    • Compile all other code using the MPI compiler drivers
    • Link with the MPI driver, passing the correct -L and -l flags for finding and using the CUDA libraries (e.g., -lcudart)

With our SHOC benchmark suite, we have found that the second approach (using MPI compiler drivers) is less error prone.

Note that it may be possible to avoid this problem if you use functions like cudaLaunch to launch your kernels instead of the triple chevron syntax.  If all of your use of CUDA is through function calls, it may be possible to have CUDA and MPI code in the same source file and compile using the MPI compiler drivers.  However, you will still have to determine the CUDA compiler and linker flags and pass those to the MPI compiler driver, so the practical benefit of this approach may be minimal.