Batch Scripts

All non-interactive jobs must be submitted on Keeneland through a job script using the qsub command. Job scripts generally start with a series of #PBS declarations that describe requirements of the job to the batch system/scheduler. The rest is a shell script, which sets up and runs the executable: the mpirun command is used to launch one or more parallel executables on the compute nodes.

The following options may all be supplied in the script with the #PBS option directive, or as flags like $ qsub options batch.pbs. The recommendation is to use directives unless you wish to run the same script with different PBS options (for example, if you wish to have job dependencies).

The most common PBS options are given below. For a complete list, see man qsub.

-A account number
Account to charge job to
-l nodes=nodes:ppn=<procs per node>:gpus=<gpus per node>
Specify job parameters. Note that procs per node must be an integer from 1 to 12.
-l walltime=hh:mm:ss
Time allocated to job in hours, minutes, and seconds.
-m {a;b;e}
Send an email notification when the job aborts, begins, or ends respectively. You may specify all options if desired (-m abe)
-M email_address
Email address to send above notifications
-j oe
Joins error and output strings into single output file (or error file if -j eo)

Notes:

  • Your job will be killed when its walltime is reached, so this should be an over-estimate. It is generally beneficial to make this as accurate as possible, however, as this information helps the scheduler determine when your job can run – long running jobs will generally never backfill.
  • The scheduler uses the exit status for the last command for the job. If you wish to do cleanup after the main CUDA/MPI program, it may be helpful to check the exit status explicitly. See the example script below.
  • Jobs start in the user's home directory. If you wish to use the directory from which the job was submitted, you may want to use cd $PBS_O_WORKDIR.
  • The use of #PBS -V to import all environment variables at time of submission is discouraged as this can cause issues with the batch system, particularly when there are large and/or many variables. Best practice is to set up the environment explicitly within the batch script. This also makes the batch script more self contained, therefore easier to debug. If necessary, #PBS -v <VAR> may be used to import specific environment variables.

The following example shows a simple job script that executes ./prgm.x on 48 cores (which may include 3 GPUs/node * 4 nodes = 12 GPUs), charged to the fictitious account TG-EXAMPLE01 with a wall clock limit of one hour and 35 minutes. It will email me@foo.bar when it begins.

#!/bin/bash
#PBS -A TG-EXAMPLE01
#PBS -l nodes=4:ppn=12:gpus=3:shared,walltime=01:35:00
#PBS -m b
#PBS -S /bin/bash
#PBS -M me@foo.bar

cd $PBS_O_WORKDIR
mpirun ./prgm.x
PRGM_EXIT=$?
if [ "$PRGM_EXIT" == "0" ]; then
  echo "success"
else
  echo "failure"
fi

$(exit $PRGM_EXIT)

If you wished to submit it to run after job 1234 exits successfully, you can submit it as such, assuming the jobscript is called job.pbs:

# qsub -W depend=afterok:1234 job.pbs