OpenACC
OpenACC
OpenACC is a new open parallel programming standard which allows parallel programmers to provide simple hints, known as “directives,” to the compiler, identifying which areas of code to accelerate, without requiring programmers to modify or adapt the underlying code itself. By exposing parallelism to the compiler, directives allow the compiler to do the detailed work of mapping the computation onto the accelerator.
OpenACC was developed by PGI, Cray, and NVIDIA with support from CAPS. Companies with products committed to the OpenACC standard are:
- Portland Group (PGI): Accelerator Compiler
- CAPS: HMPP Workbench
- Cray Corporation: Compilation Environment
The OpenACC Application Program Interface describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator, providing portability across operating systems, host CPUs and accelerators.
The directives and programming model allow programmers to create high-level host+accelerator programs without the need to explicitly initialize the accelerator, manage data or program transfers between the host and accelerator, or initiate accelerator startup and shutdown. All of these details are implicit in the programming model and are managed by the OpenACC API-enabled compilers and runtimes.
The programming model allows the programmer to augment information available to the compilers, including specification of data local to an accelerator, guidance on mapping of loops onto an accelerator, and similar performance-related details.
OpenACC on Keeneland
On Keeneland, we have PGI Accelerator; the version number is a moving target. Load the compiler like so:
module unload PE-intel module load PE-pgi
You can find out detailed information about the GPU devices on a compute node if you exectute the following command while logged onto a compute node:
pgaccelinfo
You can specify which GPU to run on like so:
export ACC_DEVICE_NUM=0
A typical compile command will look like the following:
pgcc -acc -ta=nvidia,time -Minfo=accel laplace2d.c –o laplace2d_acc
In the above command, -acc invokes the OpenACC compiler; -ta=nvidia,time targets an NVIDIA GPU and gives timing information at runtime; –Minfo=accel outputs compile-time information about the translation from C to CUDA. The following command will run the OpenACC executable created above:
qsub laplace2d_acc.job –A
Where laplace2d_acc.job follows:
#!/bin/bash #PBS -l walltime=00:05:00 #PBS -j oe cd $PBS_O_WORKDIR ./laplace2d_acc
OpenACC Example
The following (laplace2d.c) is the canonical OpenACC friendly code example:
/* * Copyright 2012 NVIDIA Corporation * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ #include <math.h> #include <string.h> #include <openacc.h> #include <stdlib.h> #include <sys/time.h> #define NN 4096 #define NM 4096 double A[NN][NM]; double Anew[NN][NM]; struct timeval timerStart; void StartTimer() { gettimeofday(&timerStart, NULL); } // time elapsed in ms double GetTimer() { struct timeval timerStop, timerElapsed; gettimeofday(&timerStop, NULL); timersub(&timerStop, &timerStart, &timerElapsed); return timerElapsed.tv_sec*1000.0+timerElapsed.tv_usec/1000.0; } int main(int argc, char** argv) { const int n = NN; const int m = NM; const int iter_max = 1000; const double tol = 1.0e-6; double error = 1.0; memset(A, 0, n * m * sizeof(double)); memset(Anew, 0, n * m * sizeof(double)); for (int j = 0; j < n; j++) { A[j][0] = 1.0; Anew[j][0] = 1.0; } printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m); StartTimer(); int iter = 0; while ( error > tol && iter < iter_max ) { error = 0.0; #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error); iter++; } double runtime = GetTimer(); printf(" total: %f s\n", runtime / 1000); }
In the above code example, we have asked the OpenACC compiler to offload two areas of code onto the GPU. The following information comes from the compiler at compile time:
pgcc -acc -ta=nvidia,time -Minfo=accel -o laplace2d_acc laplace2d.c main: 56, Generating copyin(A[:][:]) Generating copyout(Anew[1:4094][1:4094]) Generating compute capability 1.3 binary Generating compute capability 2.0 binary 57, Loop is parallelizable 59, Loop is parallelizable Accelerator kernel generated 57, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */ 59, #pragma acc loop gang, vector(8) /* blockIdx.x threadIdx.x */ CC 1.3 : 16 registers; 576 shared, 32 constant, 0 local memory bytes; 50% occupancy CC 2.0 : 19 registers; 520 shared, 80 constant, 0 local memory bytes; 33% occupancy 63, Max reduction generated for error 68, Generating copyout(A[1:4094][1:4094]) Generating copyin(Anew[1:4094][1:4094]) Generating compute capability 1.3 binary Generating compute capability 2.0 binary 69, Loop is parallelizable 71, Loop is parallelizable Accelerator kernel generated 69, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */ 71, #pragma acc loop gang, vector(8) /* blockIdx.x threadIdx.x */ CC 1.3 : 6 registers; 48 shared, 8 constant, 0 local memory bytes; 50% occupancy CC 2.0 : 10 registers; 8 shared, 56 constant, 0 local memory bytes; 33% occupancy
It can be seen that the compiler made a kernel from each of the compiler directives. It made the decision about what data to copy between the host and the device. It also created a max reduction kernel automatically. The pragma directives left grid and block sizes up to the compiler. The compiler created a 512 x 512 grid of 8 x 8 blocks. When the executable created from the above compile runs, it takes 160.47 seconds; much longer that the 38.14 seconds required for an OpenMP version that uses several CPU cores.
At runtime, the reason for the excessive time for the GPU version can be easily seen:
./laplace2d_acc Accelerator Kernel Timing data /nics/b/home/mhorton/training/solutions/001-laplace2D-kernels/laplace2d.c main 68: region entered 1000 times time(us): total=77809959 init=234 region=77809725 kernels=4481589 data=72381331 w/o init: total=77809725 max=372055 min=76309 avg=77809 71: kernel launched 1000 times grid: [512x512] block: [8x8] time(us): total=4481589 max=4491 min=4474 avg=4481 /nics/b/home/mhorton/training/solutions/001-laplace2D-kernels/laplace2d.c main 56: region entered 1000 times time(us): total=83634002 init=172860 region=83461142 kernels=9781822 data=70638021 w/o init: total=83461142 max=371159 min=82113 avg=83461 59: kernel launched 1000 times grid: [512x512] block: [8x8] time(us): total=9324028 max=9645 min=9297 avg=9324 63: kernel launched 1000 times grid: [1] block: [256] time(us): total=457794 max=463 min=456 avg=457 Jacobi relaxation Calculation: 4096 x 4096 mesh
The majority of the run time is spent copying data between the host and the device. This is because for each of 1000 kernel launches the entire array is copied two times. It only needs to be copied once in the beginning and once at the end. This is accomplished like so:
#pragma acc data copy(A), create(Anew) while ( error > tol && iter < iter_max ) { error = 0.0; #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1} + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error); iter++; }
Now the GPU run time is 14.69 seconds, much lower than the CPU OpenMP version.
We can do one more relatively simple thing to eke out a little more performance. We can give the compiler information about the grid size and the block size. The following does just that:
#pragma acc data copy(A), create(Anew) while ( error > tol && iter < iter_max ) { error = 0.0; #pragma acc kernels loop gang(32), vector(16) for( int j = 1; j < n-1; j++) { #pragma acc loop gang(16), vector(32) for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1} + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma acc kernels for( int j = 1; j < n-1; j++) { #pragma acc loop gang(16), vector(32) for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error); iter++; }
Now the run time is 8.09 seconds.