OpenACC
OpenACC
OpenACC is a new open parallel programming standard which allows parallel programmers to provide simple hints, known as “directives,” to the compiler, identifying which areas of code to accelerate, without requiring programmers to modify or adapt the underlying code itself. By exposing parallelism to the compiler, directives allow the compiler to do the detailed work of mapping the computation onto the accelerator.
OpenACC was developed by PGI, Cray, and NVIDIA with support from CAPS. Companies with products committed to the OpenACC standard are:
- Portland Group (PGI): Accelerator Compiler
- CAPS: HMPP Workbench
- Cray Corporation: Compilation Environment
The OpenACC Application Program Interface describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator, providing portability across operating systems, host CPUs and accelerators.
The directives and programming model allow programmers to create high-level host+accelerator programs without the need to explicitly initialize the accelerator, manage data or program transfers between the host and accelerator, or initiate accelerator startup and shutdown. All of these details are implicit in the programming model and are managed by the OpenACC API-enabled compilers and runtimes.
The programming model allows the programmer to augment information available to the compilers, including specification of data local to an accelerator, guidance on mapping of loops onto an accelerator, and similar performance-related details.
OpenACC on Keeneland
On Keeneland, we have PGI Accelerator; the version number is a moving target. Load the compiler like so:
module unload PE-intel module load PE-pgi
You can find out detailed information about the GPU devices on a compute node if you exectute the following command while logged onto a compute node:
pgaccelinfo
You can specify which GPU to run on like so:
export ACC_DEVICE_NUM=0
A typical compile command will look like the following:
pgcc -acc -ta=nvidia,time -Minfo=accel laplace2d.c –o laplace2d_acc
In the above command, -acc invokes the OpenACC compiler; -ta=nvidia,time targets an NVIDIA GPU and gives timing information at runtime; –Minfo=accel outputs compile-time information about the translation from C to CUDA. The following command will run the OpenACC executable created above:
qsub laplace2d_acc.job –A
Where laplace2d_acc.job follows:
#!/bin/bash #PBS -l walltime=00:05:00 #PBS -j oe cd $PBS_O_WORKDIR ./laplace2d_acc
OpenACC Example
The following (laplace2d.c) is the canonical OpenACC friendly code example:
/* * Copyright 2012 NVIDIA Corporation * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ #include <math.h> #include <string.h> #include <openacc.h> #include <stdlib.h> #include <sys/time.h> #define NN 4096 #define NM 4096 double A[NN][NM]; double Anew[NN][NM]; struct timeval timerStart; void StartTimer() { gettimeofday(&timerStart, NULL); } // time elapsed in ms double GetTimer() { struct timeval timerStop, timerElapsed; gettimeofday(&timerStop, NULL); timersub(&timerStop, &timerStart, &timerElapsed); return timerElapsed.tv_sec*1000.0+timerElapsed.tv_usec/1000.0; } int main(int argc, char** argv) { const int n = NN; const int m = NM; const int iter_max = 1000; const double tol = 1.0e-6; double error = 1.0; memset(A, 0, n * m * sizeof(double)); memset(Anew, 0, n * m * sizeof(double)); for (int j = 0; j < n; j++) { A[j][0] = 1.0; Anew[j][0] = 1.0; } printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m); StartTimer(); int iter = 0; while ( error > tol && iter < iter_max ) { error = 0.0; #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error); iter++; } double runtime = GetTimer(); printf(" total: %f s\n", runtime / 1000); }
In the above code example, we have asked the OpenACC compiler to offload two areas of code onto the GPU. The following information comes from the compiler at compile time:
pgcc -acc -ta=nvidia,time -Minfo=accel -o laplace2d_acc laplace2d.c
main:
56, Generating copyin(A[:][:])
Generating copyout(Anew[1:4094][1:4094])
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
57, Loop is parallelizable
59, Loop is parallelizable
Accelerator kernel generated
57, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
59, #pragma acc loop gang, vector(8) /* blockIdx.x threadIdx.x */
CC 1.3 : 16 registers; 576 shared, 32 constant, 0 local memory bytes; 50% occupancy
CC 2.0 : 19 registers; 520 shared, 80 constant, 0 local memory bytes; 33% occupancy
63, Max reduction generated for error
68, Generating copyout(A[1:4094][1:4094])
Generating copyin(Anew[1:4094][1:4094])
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
69, Loop is parallelizable
71, Loop is parallelizable
Accelerator kernel generated
69, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
71, #pragma acc loop gang, vector(8) /* blockIdx.x threadIdx.x */
CC 1.3 : 6 registers; 48 shared, 8 constant, 0 local memory bytes; 50% occupancy
CC 2.0 : 10 registers; 8 shared, 56 constant, 0 local memory bytes; 33% occupancy
It can be seen that the compiler made a kernel from each of the compiler directives. It made the decision about what data to copy between the host and the device. It also created a max reduction kernel automatically. The pragma directives left grid and block sizes up to the compiler. The compiler created a 512 x 512 grid of 8 x 8 blocks. When the executable created from the above compile runs, it takes 160.47 seconds; much longer that the 38.14 seconds required for an OpenMP version that uses several CPU cores.
At runtime, the reason for the excessive time for the GPU version can be easily seen:
./laplace2d_acc
Accelerator Kernel Timing data
/nics/b/home/mhorton/training/solutions/001-laplace2D-kernels/laplace2d.c
main
68: region entered 1000 times
time(us): total=77809959 init=234 region=77809725
kernels=4481589 data=72381331
w/o init: total=77809725 max=372055 min=76309 avg=77809
71: kernel launched 1000 times
grid: [512x512] block: [8x8]
time(us): total=4481589 max=4491 min=4474 avg=4481
/nics/b/home/mhorton/training/solutions/001-laplace2D-kernels/laplace2d.c
main
56: region entered 1000 times
time(us): total=83634002 init=172860 region=83461142
kernels=9781822 data=70638021
w/o init: total=83461142 max=371159 min=82113 avg=83461
59: kernel launched 1000 times
grid: [512x512] block: [8x8]
time(us): total=9324028 max=9645 min=9297 avg=9324
63: kernel launched 1000 times
grid: [1] block: [256]
time(us): total=457794 max=463 min=456 avg=457
Jacobi relaxation Calculation: 4096 x 4096 mesh
The majority of the run time is spent copying data between the host and the device. This is because for each of 1000 kernel launches the entire array is copied two times. It only needs to be copied once in the beginning and once at the end. This is accomplished like so:
#pragma acc data copy(A), create(Anew)
while ( error > tol && iter < iter_max ) {
error = 0.0;
#pragma acc kernels
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1} + A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma acc kernels
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
A[j][i] = Anew[j][i];
}
}
if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);
iter++;
}
Now the GPU run time is 14.69 seconds, much lower than the CPU OpenMP version.
We can do one more relatively simple thing to eke out a little more performance. We can give the compiler information about the grid size and the block size. The following does just that:
#pragma acc data copy(A), create(Anew)
while ( error > tol && iter < iter_max ) {
error = 0.0;
#pragma acc kernels loop gang(32), vector(16)
for( int j = 1; j < n-1; j++)
{
#pragma acc loop gang(16), vector(32)
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1} + A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma acc kernels
for( int j = 1; j < n-1; j++)
{
#pragma acc loop gang(16), vector(32)
for( int i = 1; i < m-1; i++ )
{
A[j][i] = Anew[j][i];
}
}
if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);
iter++;
}
Now the run time is 8.09 seconds.



