GPUDirect

GPUDirect is the marketing name given to several optimizations for data transfer between GPUs. For the remainder of this document, these optimizations will be referred to as GPUDv1, GPUDv2, and GPUDv3.


GPUDv1

As described by NVIDIA at http://developer.nvidia.com/cuda/nvidia-gpudirect, the first version (GPUDv1) available from CUDA 3.1 onwards allows the Infiniband driver and GPU device driver to use the same page-locked (“pinned”) buffer in the host memory, therefore removing a redundant copy in host memory. When data from GPU memory is to be sent to another node through an Infiniband device, the data must be transferred to a data buffer for the GPU in system memory, then to the data buffer for the Infiniband device in system memory. GPUDv1 skips this copy in system memory, allowing both the GPU and Infiniband device to access the same data buffer. This situation is illustrated in the figure below from http://developer.nvidia.com/cuda/nvidia-gpudirect.


GPUv2

The second version (GPUDv2) available with CUDA 4.0 and onwards enabled two GPUs on the same PCIe bus to directly transfer data between their memories. Without GPUDv2, to get data from one GPU to another on the same host, one would use cudaMemcpy(...,cudaMemcpyDevicetToHost) first to get the data from the GPU to system memory, then another cudaMemcpy(...,cudaMemcpyDeviceToHost) to get the same data onto the second GPU. GPUDv2 allows two GPUs on the same PCIe bus to transfer data directly, avoiding any copies to system memory. There are two modes in which this can be done explicitly: a kernel executing on the first GPU loads or stores data in the memory residing on a second GPU, or a transfer is executed from the memory of the first GPU to the second GPU using cudaMemcpy(). These two cases are shown in this figure from http://developer.nvidia.com/cuda/nvidia-gpudirect.

There is actually a third possible way to use GPUDv2. A CUDA-aware MPI implementation (at least mvapich2 v1.8 or openmpi 1.7) notices the transfer of two MPI threads is between GPU device buffers on the same PCIe bus and automatically selects the GPUv2 peer-to-peer protocol for the requested MPI communication.


GPUDv3

The third version (GPUDv3) which will be available starting with CUDA 5.0 will use RDMA transfers for data across an Infiniband network between two GPUs in separate nodes, bypassing host memory altogether. This is show in the following figure from http://developer.nvidia.com/cuda/nvidia-gpudirect.




GPUDirect on Keeneland systems

This is the architecture of a single node on Keeneland (KIDS), which consists of an HP SL390 with three NVIDIA M2090 GPUs. GPUDv1 is available Keeneland, but GPUDv2 is only possible between GPUs 1 and 2 because GPU 0 is on a separate PCIe bus.

Keeneland (KFS) nodes have a slightly different configuration as show in the following diagram.


Using GPUDv1

First, a note about versions and requirements for using GPUDirect. GPUDirect v1 was released with CUDA 3.1 and required both a kernel patch and setting CUDA_NIC_INTEROP=1. As of CUDA 4.0, the kernel patch is no longer necessary with kernels >= 2.6.18. From CUDA 4.1 onwards, the behavior specified by CUDA_NIC_INTEROP=1 is enabled by default.

There are a couple of ways in which GPUDv1 can accelerate data transfers using MPI. The first is simply by allowing the Infiniband driver and GPU driver to share page-locked memory. In this case, the code to transfer data from the memory of GPU1 to GPU2 on a separate node would look exactly the same as it would without GPUDirect. Specifically, a cudaMemcpy() from GPU to host would be done on the sending host, then the MPI_Send()/MPI_Recv(), and finally a cudaMemcpy() from host to GPU on the receiving host. However, to take advantage of GPUDv1, the host memory is allocated and page-locked using cudaHostAlloc(), cudaMallocHost(), or cudaHostRegister(). Doing this speeds up the host-GPU memory transfers, and therefore the entire transaction takes less time.

/* ping pong betweent two GPUs with explicit transfer CPU<->GPU */
cudaMallocHost( (void **) &a_host, size);
cudaMallocHost( (void **) &b_host, size);
if (rank == 0) {
	cudaMemcpy(a_host, a_dev, size, cudaMemcpyDeviceToHost);
	MPI_Send(a_host, size, MPI_INT, 1, 0, MPI_COMM_WORLD);
	MPI_Recv(b_host, size, MPI_INT, 1, 0, MPI_COMM_WORLD, &status);
	cudaMemcpy(b_dev, b_host, size, cudaMemcpyHostToDevcie);
} else {
	MPI_Recv(b_host, size, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
	cudaMemcpy(b_dev, b_host, size, cudaMemcpyHostToDevcie);
	/* do stuff on GPU2 */
	cudaMemcpy(b_host, b_dev, size, cudaMemcpyDeviceToHost);
	MPI_Send(b_host, size, MPI_INT, 0, 0, MPI_COMM_WORLD);
}

The second way to use GPUDv1 is with a CUDA-aware MPI implementation (at least mvapich2 v1.8 or openmpi 1.7). In this scenario, the device buffers are directly passed to the MPI communication calls, and no cudaMemcpy() is involved in the transfer from GPU1 to GPU2 on a separate node.

Some configuration needs to be done to use MPI with GPUs. For the mvapich2 case, we have these hints from an NVIDIA GPUDirect technical note:


This support can be enabled by configuring MVAPICH2 with --enable-cuda and setting the environment variable MV2_USE_CUDA to 1 during runtime.

To minimize communication overhead, MVAPICH2 divides copies between device and host into chunks. This can be better tuned with a runtime environment variable MV2_CUDA_BLOCK_SIZE. The default chunk size is 64K (65536). However, higher values of this parameter, such as 256K (262144) and 512K (524288), might deliver better performance if the MPI application uses large messages. The optimal value for this parameter depends on several other factors such as InfiniBand network/adapter speed, GPU adapter characteristics, platform characteristics (processor and memory speed) and amount of memory to be dedicated to the MPI library with GPU support. For different platforms and MPI applications, the users are encouraged to try out different values for this parameter to get best performance.

MVAPICH2 uses network loopback for intra node communication when MV2_USE_CUDA is enabled. (More precisely, the loopback interface is used on machines with InfiniBand adaptors if MV2_USE_CUDA=1 and MV2_CUDA_IPC=0. A future release of MVAPICH2 will have a more streamlined approach to the various scenarios.) In order to use MVAPICH2 NVIDIA GPU features on stand alone multi-core GPU nodes that are not equipped with InfiniBand adapters, users need to enable shared memory through the following run time environment variables: MV2_USE_SHARED_MEM=1 MV2_SMP_SEND_BUF_SIZE=262144. The parameter MV2_SMP_SEND_BUF_SIZE controls the size of copies used in large message shared memory communication. Users can better tune based on the requirement. Some suggested values are 65536, 131072 or 524288.

GPU Affinity: When multiple GPUs are present on a node, users might want to set the MPI process affinity to a particular GPU using cuda calls like cudaSetDevice(). This can be done after MPI_Init based on MPI rank of the process. But MVAPICH2 performs some cuda operations like buffer registration and others during MPI_Init which result in default context creation. Hence, setting GPU Affinity after MPI_Init could create issues due to the context switch. To avoid this, MVAPICH2 provides an environment variable called MV2_COMM_WORLD_LOCAL_RANK to get the local rank of a process on its node before MPI_Init is called. This local rank information can be used to set GPU affinity before MPI_Init is called as given in the following code example


....
int local_rank = atoi(getenv(”MV2_COMM_WORLD_LOCAL_RANK”));
cudaSetDevice(local_rank % num_devices);
...
MPI_Init()
...


This local rank information can also be used in wrapper scripts to set cpu and memory affinity on nodes with NUMA.


And for openmpi usage, this FAQ explains GPU support:


Open MPI recently added support for sending and receiving CUDA device memory directly. Prior to this support, the programmer would first have to stage the data in host memory prior to making the MPI calls. Now, the Open MPI library will automatically detect that the pointer being passed in is a CUDA device memory pointer and do the right thing.

The use of device pointers is supported in all of the send and receive APIs as well as most of the collective APIs. Neither the collective reduction APIs nor the one-sided APIs are supported. Here is the list of APIs that are currently support sending and receiving CUDA device memory.

MPI_Send, MPI_Bsend, MPI_Ssend, MPI_Rsend, MPI_Isend, MPI_Ibsend, MPI_Issend, MPI_Irsend, MPI_Send_init, MPI_Bsend_init, MPI_Ssend_init, MPI_Rsend_init, MPI_Recv, MPI_Irecv, MPI_Recv_init, MPI_Sendrecv, MPI_Bcast, MPI_Gather, MPI_Gatherv, MPI_Allgather, MPI_Allgatherv, MPI_Alltoall, MPI_Alltoallv, MPI_Scatter, MPI_Scatterv

Open MPI depends on various new features of CUDA 4.0, so one needs to have the CUDA 4.0 driver and toolkit. The new features of interest are the Unified Virtual Addressing (UVA) so that all pointers within a program have unique addresses. In addition, there is a new API that allows one to determine if a pointer is a CUDA device pointer or host memory pointer. This API is used by the library to decide what needs to be done with each buffer. In addition, CUDA 4.1 also provides the ability to register host memory with the CUDA driver which can improve performance. Until CUDA 4.1 is released, users may see a warning about trying to register memory and failing. That is just a warning and can be ignored as things will still work.

If utilizing the driver API, the application needs to ensure that it has called cuInit() and cuCtxCreate() prior to calling MPI_Init. With the CUDA runtime API, one needs to make sure that the runtime has been initialized so that the MPI library has a valid CUDA context.

The Open MPI implementation essentially substitutes cuMemcpy calls for memcpy calls in the library when device memory is detected. This means there are some performance effects that should be noted. First, in order to utilize the cuMemcpy, the library automatically switches to protocols that first stage the data in host memory. Therefore, for larger messages, there can be some performance degradation as the large message RDMA protocols cannot be used for sending device memory directly. Secondly, there is a latency hit on each cuMemcpy call of around 10 usecs. This means that one might see an additional 20 usecs overhead (copy in and copy out) on top of the transport latency.

Derived datatypes, both contiguous and non-contiguous, are supported. However, the non-contiguous datatypes currently have high overhead because of the many calls to cuMemcpy to copy all the pieces of the buffer into the intermediate buffer.

All of these issues are currently being investigated and hope to be improved upon.


Here is an example of what the transfer would look like in an application using MPI to transfer GPU buffers:

/* ping pong betweent two GPUs with MPI-managed CPU<->GPU */
cudaMallocHost( (void **) &a_host, size);
cudaMallocHost( (void **) &b_host, size);
if (rank == 0) {
	MPI_Send(a_dev, size, MPI_INT, 1, 0, MPI_COMM_WORLD);
	MPI_Recv(b_dev, size, MPI_INT, 1, 0, MPI_COMM_WORLD, &status);
} else {
	MPI_Recv(b_dev, size, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
	/* do stuff on GPU2 */
	MPI_Send(b_dev, size, MPI_INT, 0, 0, MPI_COMM_WORLD);
}

A specific point should be addressed here. Even though in the code it looks like the data is going directly from GPU1 across the network to GPU2 (as there is no cudaMemcpy() being done), it still has to travel through the host memories on both sides – but MPI hides this detail for us. GPUDv3 will allow RDMA transfers directly between GPUs on separate nodes across an Infiniband network.


Using GPUDv2

GPUDirect v2 was first released with CUDA 4.0. Again, a CUDA-aware MPI implementation (at least mvapich2 v1.8 or openmpi 1.7) can handle this scenario if it notices the transfer is between GPUs on the PCIe bus. It will then select the GPUv2 peer-to-peer protocol for the transfer.

In order to use GPUDv2 with explicit transfers (without MPI), there are a few checks that should be performed in the code. When running code on the Keeneland systems, the GPUs available and layout of the hardware are known, but these checks should be done for portability.

/* Check for multiple GPUs (to use GPUDv2) */
int num_gpus;
cudaGetDeviceCount(&num_gpus);

/* Check capability of the GPU 
    (should be done for each card to be used)
*/
cudaDeviceProp prop1, prop2;

// second argument is gpu number
cudaGetDeviceProperties(&prop1, 1);
cudaGetDeviceProperties(&prop2, 2);

// check results
is_fermi = false;
if(prop1.major >= 2 && prop2.major >= 2)  is_fermi = true; // must be Fermi based

Another check is needed to make sure that the two GPUs are located on the same PCIe root complex. This is because the GPUDv2 optimization uses PCIe bus mastering. So, on Keeneland, GPUDv2 is possible only between GPUs 1 and 2. GPU 0 is on a different I/O hub, and since a GPU can’t bus master across QPI, the host must be involved for communication between GPU 0 and GPU 1/2. This is explained in an Intel whitepaper.

“The IOH does not support non-contiguous byte enables from PCI Express for remote peer-to-peer MMIO transactions. This is an additional restriction over the PCI Express standard requirements to prevent incompatibility with Intel QuickPath Interconnect.“

However, cudaMemcopy() automatically falls back to Device-to-Host-to-Device when P2P is unavailable.

int access2from1, access1from2;

// would return false for GPU 0 on Kids
cudaDeviceCanAccessPeer(&access2from1, 2, 1);
cudaDeviceCanAccessPeer(&access1from2, 1, 2);

same_complex = false;
if(access2from1==1 && access1from2==1) same_complex = true;

If all checks have passed, then peer to peer access is enabled and the rest of the program can follow. (It does not actually matter where peer access is enabled in the program; it can be before or after the memory allocation, but must be before the p2p transfer is to take place.)

/* if all checks passed, enable peer to peer access */
if(is_fermi && compatible_driver && same_complex)
{
    cudaSetDevice(1);
    cudaDeviceEnablePeerAccess(2, 0); //second argument is flags
    cudaSetDevice(2);
    cudaDeviceEnablePeerAccess(1, 0); //second argument is flags
}

/* allocate data on the 2 gpus */
cudaSetDevice(1);
cudaMalloc(&gpu1data, nbytes);
cudaSetDevice(2);
cudaMalloc(&gpu2data, nbytes);

As mentioned, there are two ways to perform the peer to peer data transfer between the GPUs. The first is to copy the data from one GPU's memory to the other explicitly, using a cudaMemcpy() call.

/* allocate data on the 2 gpus */
cudaSetDevice(1);
cudaMalloc(&gpu1data, nbytes);
cudaSetDevice(2);
cudaMalloc(&gpu2data, nbytes);

/* copy data to GPU 1 from host */
cudaMemcpy(gpu1data, hostdata, cudaMemcpyHostToDevice);

some_more_code();

/* copy the data from GPU 1 to GPU 2 */
cudaMemcpy(gpu2data, gpu1data, cudaMemcpyDefault);

/* execute kernel on GPU 2 using acquired data */
cudaSetDevice(2);
gpu2_kernel<<>>(..., gpu2data, ...);

The second way to share data between GPUs is to let the kernel running on a GPU access directly memory that resides on the other GPU.

/* allocate data on the 2 gpus */
cudaSetDevice(1);
cudaMalloc(&gpu1data, nbytes);
cudaSetDevice(2);
cudaMalloc(&gpu2data, nbytes);

/* copy data to GPU 1 from host */
cudaMemcpy(gpu1data, hostdata, cudaMemcpyHostToDevice);

some_more_code();

/* execute kernel on GPU 2 accessing memory on GPU 1 */
cudaSetDevice(2);
gpu2_kernel<<>>(..., gpu1data, ...);

Here, gpu2_kernel can both read and write to gpu1data, the memory on GPU 1. Be aware that synchronization issues must be considered so that the data being read is in fact ready for use by another kernel on another device.


Using GPUDv3

GPUDirect v3 is not yet available on the Keeneland systems. It requires an enabled Infiniband driver, and CUDA 5.0. These are not yet installed in the production environment, but it is expected that they will be available in the future.


Unified Virtual Addressing

There is a CUDA feature called “Unified Virtual Addressing” (UVA) which has been assumed in the above examples. This means that there is one address space that can be shared between the host and all GPUs on the system. Any memory that is allocated with cudaMemcpy() or cudaHostAlloc() will reside in this shared space. For our examples, this has a couple of implications:

  1. The cudaMemcpyKind parameter to cudaMemcpy() becomes unnecessary. cudaMemcpyDefault can be used instead, and the CUDA runtime will figure out the proper operation based on the source and destination pointers that are used with cudaMemcpy().
  2. UVA must be used to let a kernel running a GPU directly access memory on another GPU.

Since the system malloc() is used in the provided examples for allocating host memory, cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are still used in cudaMemcpy() calls that involve the host. This could be avoided by using cudaHostAlloc() in place of the system malloc(), in which case cudaMemcpyDefault could be used instead.

UVA is enabled by default on any devices with CUDA compute capability of 2.0 or higher, but it can be checked for explicitly using the unifiedAddressing device property.

/* check for Unified Virtual Addressing */
cudaDeviceProp prop1;

// second argument is gpu number
cudaGetDeviceProperties(&prop1, 1);

has_uva = false;
if(prop1.unifiedAddressing == 1) has_uva = true;

GPUDirect Performance on Keeneland (KIDS and KFS)

Below is a graph for the bandwidths that various data transfer scenarios yield on KFS. In all cases, the bandwidth was measured as the data size divided by half the total round-trip time.

In this plot, we show for reference the bandwidth for the following:

  1. A non-gpu transfer between two separate compute nodes (green diamonds)
  2. Host memory to GPU memory copies on a single node using both non page-locked memory (black circles)
  3. Host memory to GPU memory copies on a single node using page-locked memory (red squares)

Next we show the bandwidth to transfer data from GPU to GPU on separate nodes. In all cases, the data must travel through the host CPUs, both for the outgoing and incoming messages (GPU1 → CPU1 → (IB network) → CPU2 → GPU2).

  1. The “gpu-gpu” (blue triangles) is for the non-GPUDv1 case when memory the data would be moved explicitly between the GPU and CPU memory buffers, and then across the network.
  2. A performance gain is realized when the GPU and IB drivers share a page-locked buffer using GPUDv1 (purple crosses).
  3. Finally, if the MPI implementation is capable of handling the transfer for us, an even better speedup is gained by passing the device buffers directly to the MPI communication calls (brown stars).

As explained above, in this last case even though in the code it looks like the data is going directly from GPU1 across the network to GPU2 (as there is no cudaMemcpy() being done), it still has to travel through the host memories on both sides - MPI hides this detail for us. GPUDv3 will allow RDMA transfers directly between GPUs on separate nodes across an Infiniband network.

And here are the bandwidth measurements for GPUDv2 on KIDS. These curves are fairly self-explanatory.