Running Amber on Keeneland

Job Configuration

When running Amber with gpu acceleration, best results are achieved by using a single MPI thread per gpu desired on each node. For example, if a job is to use 6 total gpus from 2 nodes (3 gpus per node), then exactly 3 MPI threads should be executed on each node. The gpus must be requested in "shared" mode to allow Amber proper access. All of this can be done with the following line added at the beginning of the job script:

#PBS -l nodes=2:ppn=3:gpus=3:shared

Note that the "gpus" specification above is gpus-per-node, rather than total gpus for the job. Information about gpus in use is printed at the beginning of normal execution output when using a gpu-enabled version of Amber, indicated by the heading "GPU DEVICE INFO." More information about interpreting this output can be found on the Amber website under the subsection "Multi GPU."

Modules

To use Amber12 on Keeneland, simply load the provided module. It is recommended to use Cuda version 4.2. These commands can be put in the job script at any point before executing Amber.

$ module swap cuda/4.1 cuda/4.2
$ module load amber/12

Execution

To run Amber serially with gpu acceleration, use pmemd.cuda. For Amber with gpu acceleration on two or more gpus, use pmemd.cuda.MPI.

An example execution might look like the following (given the job configuration shown above):

$ mpiexec -n 6 pmemd.cuda.MPI -O -i mdin -o mdout -p prmtop -c inpcrd

Performance

Amber runs quite well using gpus for acceleration. Using some standard benchmark input files as described on the Amber website, the following timings were observed on Keeneland and Kraken.

These numbers are taken from the highest point in the scaling curves on both systems, which means that adding more resources to calculations will most likely not increase the performance. Note that in most cases, there is a 3-4x performance advantage on Keeneland, coupled with a 8-32x decrease in SU's consumed.

System Keeneland (Amber12) Keeneland (Amber11) Kraken
DHFR NVE = 23,558 atoms. sxplicit solvent (PME) 75.12 ns/day (3 nodes, 9 gpus) 55.60 ns/day (3 nodes, 9 gpus) 19.03 ns/day (6 nodes, 72 cores)
Cellulose NVE = 408,609 atoms. sxplicit solvent (PME) 5.23 ns/day (3 nodes, 9 gpus) 4.53 ns/day (3 nodes, 9 gpus) 4.02 ns/day (24 nodes, 288 cores)
Myoglobin = 2,492 atoms. implicit solvent (GB) 147.44 ns/day (5 nodes, 10 gpus) 78.12 ns/day (4 nodes, 12 gpus) 28.65 ns/day (16 nodes, 192 cores)
Nucleosome = 25,095 atoms. implicit solvent (GB) 9.56 ns/day (5 nodes, 15 gpus) 4.04 ns/day (4 nodes, 12 gpus) 0.51 ns/day (16 nodes, 192 cores)

If the performance numbers for Keeneland are compared with those on the Amber website, it may seem that the Keeneland numbers are slightly lower. Some possible reasons for performance differences between these numbers could be:

  • ECC is enabled on Keeneland to ensure stability in a wide variety of calculations for all users.
  • NVCC compiler v4.1 is currently used on Keeneland, and the Amber developers have noticed a performance increase when moving to NVCC v4.2 and above.

Additional Resources

There are Amber webpages that have good recommendations for running on gpu resources. The Amber website has two sections, "Running GPU Accelerated Simulations" and "Maximizing GPU Performance," which detail steps that can be taken to get the best possible performance for Amber calculations.

Example Job Script

Here is a minimal job script that will run the “JAC_production_NVE” calculation from the Amber benchmark suite downloaded from the Amber website's benchmarks.

#PBS -A MY_ACCOUNT000
#PBS -l nodes=2:ppn=3:gpus=3:shared
#PBS -l walltime=01:00:00

module swap cuda/4.1 cuda/4.2
module load amber/12

cd $PBS_O_WORKDIR

$ mpiexec -n 6 pmemd.cuda.MPI -O -i mdin -o mdout -p prmtop -c inpcrd