OpenACC

OpenACC

OpenACC is a new open parallel programming standard which allows parallel programmers to provide simple hints, known as “directives,” to the compiler, identifying which areas of code to accelerate, without requiring programmers to modify or adapt the underlying code itself. By exposing parallelism to the compiler, directives allow the compiler to do the detailed work of mapping the computation onto the accelerator. 

OpenACC was developed by PGI, Cray, and NVIDIA with support from CAPS. Companies with products committed to the OpenACC standard are:

  • Portland Group (PGI): Accelerator Compiler
  • CAPS: HMPP Workbench
  • Cray Corporation: Compilation Environment 

The OpenACC Application Program Interface describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator, providing portability across operating systems, host CPUs and accelerators.

The directives and programming model allow programmers to create high-level host+accelerator programs without the need to explicitly initialize the accelerator, manage data or program transfers between the host and accelerator, or initiate accelerator startup and shutdown. All of these details are implicit in the programming model and are managed by the OpenACC API-enabled compilers and runtimes.

The programming model allows the programmer to augment information available to the compilers, including specification of data local to an accelerator, guidance on mapping of loops onto an accelerator, and similar performance-related details.

OpenACC on Keeneland

On Keeneland, we have PGI Accelerator; the version number is a moving target. Load the compiler like so:

 

module unload PE-intel
module load PE-pgi

You can find out detailed information about the GPU devices on a compute node if you exectute the following command while logged onto a compute node:

pgaccelinfo

You can specify which GPU to run on like so:

export ACC_DEVICE_NUM=0

A typical compile command will look like the following:

pgcc -acc -ta=nvidia,time -Minfo=accel laplace2d.c –o laplace2d_acc

In the above command, -acc invokes the OpenACC compiler; -ta=nvidia,time targets an NVIDIA GPU and gives timing information at runtime; –Minfo=accel outputs compile-time information about the translation from C to CUDA. The following command will run the OpenACC executable created above:

qsub laplace2d_acc.job –A 

Where laplace2d_acc.job follows:

#!/bin/bash
#PBS -l walltime=00:05:00
#PBS -j oe

cd $PBS_O_WORKDIR

./laplace2d_acc

OpenACC Example

The following (laplace2d.c) is the canonical OpenACC friendly code example:

/*
 *  Copyright 2012 NVIDIA Corporation
 *
 *  Licensed under the Apache License, Version 2.0 (the "License");
 *  you may not use this file except in compliance with the License.
 *  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 *  Unless required by applicable law or agreed to in writing, software
 *  distributed under the License is distributed on an "AS IS" BASIS,
 *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 *  See the License for the specific language governing permissions and
 *  limitations under the License.
 */

#include <math.h> 
#include <string.h>
#include <openacc.h>

#include <stdlib.h>
#include <sys/time.h>

#define NN 4096
#define NM 4096

double A[NN][NM];
double Anew[NN][NM];

struct timeval timerStart;

void StartTimer()
{
    gettimeofday(&timerStart, NULL);
}

// time elapsed in ms
double GetTimer()
{
    struct timeval timerStop, timerElapsed;
    gettimeofday(&timerStop, NULL);
    timersub(&timerStop, &timerStart, &timerElapsed);
    return timerElapsed.tv_sec*1000.0+timerElapsed.tv_usec/1000.0;
}

int main(int argc, char** argv)
{
    const int n = NN;
    const int m = NM;
    const int iter_max = 1000;

    const double tol = 1.0e-6;
    double error     = 1.0;

    memset(A, 0, n * m * sizeof(double));
    memset(Anew, 0, n * m * sizeof(double));

    for (int j = 0; j < n; j++)
    {
        A[j][0]    = 1.0;
        Anew[j][0] = 1.0;
    }

    printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);

    StartTimer();
    int iter = 0;

    while ( error > tol && iter < iter_max )
    {
        error = 0.0;

#pragma acc kernels
        for( int j = 1; j < n-1; j++)
        {   
            for( int i = 1; i < m-1; i++ )
            {   
                Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
                                    + A[j-1][i] + A[j+1][i]);
                error = fmax( error, fabs(Anew[j][i] - A[j][i]));
            }
        }
        
#pragma acc kernels
        for( int j = 1; j < n-1; j++)
        {   
            for( int i = 1; i < m-1; i++ )
            {   
                A[j][i] = Anew[j][i];
            }
        }
        
        if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);
        
        iter++;
    }
    
    double runtime = GetTimer();
    
    printf(" total: %f s\n", runtime / 1000);
}

In the above code example, we have asked the OpenACC compiler to offload two areas of code onto the GPU. The following information comes from the compiler at compile time:

pgcc  -acc -ta=nvidia,time -Minfo=accel -o laplace2d_acc laplace2d.c
main:
     56, Generating copyin(A[:][:])
         Generating copyout(Anew[1:4094][1:4094])
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     57, Loop is parallelizable
     59, Loop is parallelizable
         Accelerator kernel generated
         57, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
         59, #pragma acc loop gang, vector(8) /* blockIdx.x threadIdx.x */
             CC 1.3 : 16 registers; 576 shared, 32 constant, 0 local memory bytes; 50% occupancy
             CC 2.0 : 19 registers; 520 shared, 80 constant, 0 local memory bytes; 33% occupancy
         63, Max reduction generated for error
     68, Generating copyout(A[1:4094][1:4094])
         Generating copyin(Anew[1:4094][1:4094])
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     69, Loop is parallelizable
     71, Loop is parallelizable
         Accelerator kernel generated
         69, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
         71, #pragma acc loop gang, vector(8) /* blockIdx.x threadIdx.x */
             CC 1.3 : 6 registers; 48 shared, 8 constant, 0 local memory bytes; 50% occupancy
             CC 2.0 : 10 registers; 8 shared, 56 constant, 0 local memory bytes; 33% occupancy

It can be seen that the compiler made a kernel from each of the compiler directives. It made the decision about what data to copy between the host and the device. It also created a max reduction kernel automatically. The pragma directives left grid and block sizes up to the compiler. The compiler created a 512 x 512 grid of 8 x 8 blocks. When the executable created from the above compile runs, it takes 160.47 seconds; much longer that the 38.14 seconds required for an OpenMP version that uses several CPU cores.

At runtime, the reason for the excessive time for the GPU version can be easily seen:

./laplace2d_acc

Accelerator Kernel Timing data
/nics/b/home/mhorton/training/solutions/001-laplace2D-kernels/laplace2d.c
  main
    68: region entered 1000 times
        time(us): total=77809959 init=234 region=77809725
                  kernels=4481589 data=72381331
        w/o init: total=77809725 max=372055 min=76309 avg=77809
        71: kernel launched 1000 times
            grid: [512x512]  block: [8x8]
            time(us): total=4481589 max=4491 min=4474 avg=4481
/nics/b/home/mhorton/training/solutions/001-laplace2D-kernels/laplace2d.c
  main
    56: region entered 1000 times
        time(us): total=83634002 init=172860 region=83461142
                  kernels=9781822 data=70638021
        w/o init: total=83461142 max=371159 min=82113 avg=83461
        59: kernel launched 1000 times
            grid: [512x512]  block: [8x8]
            time(us): total=9324028 max=9645 min=9297 avg=9324
        63: kernel launched 1000 times
            grid: [1]  block: [256]
            time(us): total=457794 max=463 min=456 avg=457
Jacobi relaxation Calculation: 4096 x 4096 mesh

The majority of the run time is spent copying data between the host and the device. This is because for each of 1000 kernel launches the entire array is copied two times. It only needs to be copied once in the beginning and once at the end. This is accomplished like so:

#pragma acc data copy(A), create(Anew)
 while ( error > tol && iter < iter_max ) {
        error = 0.0;
#pragma acc kernels
        for( int j = 1; j < n-1; j++)
        {
            for( int i = 1; i < m-1; i++ )
            {
                Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1} + A[j-1][i] + A[j+1][i]);
                error = fmax( error, fabs(Anew[j][i] - A[j][i]));
            }
        }
#pragma acc kernels
        for( int j = 1; j < n-1; j++)
        {
            for( int i = 1; i < m-1; i++ )
            {
                A[j][i] = Anew[j][i];
            }
        }
        if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);
        iter++;
}

Now the GPU run time is 14.69 seconds, much lower than the CPU OpenMP version.

We can do one more relatively simple thing to eke out a little more performance. We can give the compiler information about the grid size and the block size. The following does just that:

#pragma acc data copy(A), create(Anew)
while ( error > tol && iter < iter_max ) {
        error = 0.0;
#pragma acc kernels loop gang(32), vector(16)
        for( int j = 1; j < n-1; j++)
        {
#pragma acc loop gang(16), vector(32)
            for( int i = 1; i < m-1; i++ )
            {
                Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1} + A[j-1][i] + A[j+1][i]);
                error = fmax( error, fabs(Anew[j][i] - A[j][i]));
            }
        }
#pragma acc kernels
        for( int j = 1; j < n-1; j++)
        {
#pragma acc loop gang(16), vector(32)
            for( int i = 1; i < m-1; i++ )
            {
                A[j][i] = Anew[j][i];
            }
        }
        if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);
        iter++;
}

Now the run time is 8.09 seconds.