Quick Start Guide

  1. System Overview
  2. Getting a NICS Account
  3. Getting Help
  4. Logging In
  5. Configuring your Environment
    1. Modules
    2. Notes
  6. File Systems and Storage
  7. Software Development
    1. Compilers
    2. CUDA
    3. OpenCL
    4. MPI
    5. Control Version Systems
  8. Running Jobs
    1. Batch Jobs
    2. Notes on Batch Scripts
    3. NUMA
    4. Launching Jobs
    5. Queues
    6. Output
  9. Known Problems

System Overview

The Keeneland Initial Delivery (KID) system, which was delivered in October 2010. It is composed of an HP SL-390 (Ariston) cluster with Intel Westmere hex-core CPUs, NVIDIA 6GB Fermi GPUs, and a Qlogic QDR InfiniBand interconnect. Each node has two hex-core CPUs and 3 GPUs, with a total of 120 nodes, 240 CPUs and 360 GPUs.

Jobs are charged like so:

1 node-hr = 16 (KFS) CPU-hrs = 3 GPU-hrs = 3 SUs.

Getting a NICS Account

Please see Getting Access to KIDS for details on getting an account.

Once you have an account, you will be added to the Keeneland Users mailing list. System-wide announcements will broadcast to this list.

Getting Help

Please direct any questions to help@xsede.org. To ensure your question gets routed correctly, please include "Keeneland" in the subject line.

Logging In

To login to the KID system, SSH to kids.gatech.xsede.org using your NICS account as your username and your 'PIN+token code' for your PASSCODE.

Configuring your Environment

Modules

On Keeneland, modules are used to manage the environment, for example, changing PATH or LD_LIBRARY_PATH to use different applications or libraries.  Of particular note are the PE- modules which affect the compiler vendor. Modules for libraries often check this module to determine what version of the library to use.

For more information, including a list of commands, see Modules.

File Systems and Storage

Home Directories

Each user is provided with a home directory to store frequently used items such as source code, binaries, and scripts. Home directories are shared among all NICS resources, for more information see NICS Home Directories.

Groups may also request NFS project directories. These are intended to share files among a group, see NICS Project Directories.

Scratch Directories

Scratch directories are on a parallel file system, intended to provide high performance access to temporary input and output files. There is no quota, however, files that have not been accessed in 30 days may be purged. Scratch directories on Keeneland are provided at:

/lustre/medusa/<username>

For more information, see Lustre.

Software Development

As with most other software on the Keeneland system, the software development toolchain packages are managed using modules. See the modules section for more information.

Compilers

There are several compilers available on the Keeneland ID system: Intel, GNU, and PGI.

The GNU compilers are installed in system default locations, and thus are always in the user's PATH, though the PE-gnu module is required in order for mpicc to use gcc.

Note that only certain versions of the PGI compilers support PGI accelerator directives and CUDA Fortran.

CUDA

We have CUDA and NVIDIA GPU Computing SDK on the system.

MPI

There are a few MPI implementations available on the Keeneland system: OpenMPI, MVAPICH2, and MPICH2.

Select one of these MPI implementations using a command like module load openmpi/1.5.1-intel.

Note that there are also MPI implementations installed as part of the Open Fabrics Enterprise Distribution (OFED) software stack on the Keeneland ID login nodes in directories under /usr/mpi. These installations will not work correctly on the Keeneland ID system, because they have not been built to be integrated with the resource management software used on the Keeneland system (i.e., Torque).

See Running Jobs for information about launching MPI-based programs from batch jobs.

Control Version Systems

We have subversion and git on the system.

Running Jobs

Batch Jobs

Keeneland uses Torque (an open source PBS derivative) as its batch queue software, with the Moab scheduler, similar to other systems at NICS. There are some important differences, as described below. Here's an example batch queue script (see the notes afterward for some explanation). This assumes that you have set up the modules in your .bash_profile as described in the Modules section of this document.

#!/bin/sh
#PBS -N kiat-imb
#PBS -j oe
#PBS -A UT-TENN0000

### Unused PBS options ###
## If left commented, must be specified when the job is submitted:
## 'qsub -l walltime=hh:mm:ss,nodes=2:ppn=12:gpus=3:shared '
##
##PBS -l walltime=00:30:00
##PBS -l nodes=2:ppn=12:gpus=3:shared

### End of PBS options ###

date
cd $PBS_O_WORKDIR

echo "nodefile="
cat $PBS_NODEFILE
echo "=end nodefile"

# run the program

which mpirun
mpirun --mca mpi_paffinity_alone 1 /bin/hostname

date

# eof

Notes on Batch Scripts

  • The scheduler is set up to give exclusive access to nodes, so there should be no need to add a flag (like "-l naccesspolicy=singletask") to ensure each job gets its node to itself.
  • A -S parameter to PBS is required if you want to use a shell other than bash. Adding something like #!/bin/ksh in the first line is not enough to choose a different shell.
    • If you write batch scripts for another shell than bash, you must be sure that the module setup has been done as described in Modules.
    • If you are sharing your script with anyone else you must be sure that everyone who uses your script has done this setup. Since this is a burden and error prone, you might want to do the module setup explicitly in the batch script if you are using a non-bash shell for your batch scripts.
  • The account number is required. The account number is the same number as the project(s) to which your NICS account is tied to.
  • If you have your environment set up correctly, and are using the OpenMPI from /sw/keeneland/openmpi/1.5.1-intel (check the output of the 'which mpirun' from running this script, you should not need to pass either '-np 2' or '-hostfile $PBS_NODEFILE' to the mpirun command. If your mpirun's don't work, it may be that your environment is trying to use the wrong mpirun that was not built with Torque integration
  • This job script does not hard-code the number of nodes or processes per node, so these need to be specified on the command line. If you wish to specify this in the batch script, add:
    #PBS -l walltime=hh:mm:ss,nodes=2:ppn=12:gpus=3:shared
  • See NUMA for information on the mpi_paffinity_alone parameter.

NUMA

OpenMPI has optional NUMA support. It has to be built into OpenMPI at compile time, and isn't configured to do so by default. If it isn't built in, we have been using mpirun to start shell scripts that attempt to use numactl to control process and memory placement. If it is built in, the mechanism is much simpler: pass '--mca mpi_paffinity_alone 1' to the mpirun when you start your program, and don't use a separate script.

Check if the OpenMPI you are using has NUMA built in with:

ompi_info | grep affinity

If it is built into the OpenMPI you are using, there will be lines like:

           MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.4.3)
           MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.4.3)
           MCA maffinity: libnuma (MCA v2.0, API v2.0, Component v1.4.3)

Launching Jobs

Submit jobs with the qsub command. This example submits a job which uses 4 processes per node on 12 nodes for 30 minutes:

qsub -l walltime=00:30:00,nodes=2:ppn=12:gpus=3:shared kiat-imb.ksh

Queues

There are several queues defined for the ID system. The output of qstat -q shows the queues, their restrictions, and their state (e.g., 'enabled', 'running'):

kidlogin1.nics.utk.edu$ qstat -q
server: kidserv1.nics.utk.edu
Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
serial             --      --    48:00:00     1   0   0 --   E R
hpss               --      --    48:00:00   --    0   0 --   E R
capability         --      --    48:00:00   110   0   0 --   E S
parallel           --      --    48:00:00    60   0   0 --   E R
dmover             --      --    48:00:00   --    0   0 --   E R
batch              --      --       --      --    0   0 --   E R
                                               ----- -----
                                                   0     1
  • Use 'qstat -a' (or simply 'qstat' if you prefer the default output) to see jobs in the batch queues.
  • Use 'qstat -f' to see full information about all jobs, 'qstat -f id' to see full information about job with id 'id'
  • The Moab command 'showq' shows the scheduler's view of the queues. Both 'qstat -f' and 'showq' complement each other to tell you why your job isn't running.
  • The Moab command 'checkjob id' can also help troubleshoot problems with job 'id'.
  • There is currently no need to specify a queue in the current configuration. Jobs get placed in a specific queue based on their size.
    • Specifically, the 'batch' queue is a gateway queue. Submit to the batch queue, and Torque figures out which other queue that the job belongs in. Empirically, it seems like if the job requests 1 node, it ends up in the 'serial' queue. If it requests between 1 and 72 nodes, it ends up in the parallel queue. If more than 72, it ends up in the capability queue. The hpss and dmover queues are special-purpose, and you should never choose these.

Output

  • With the PBS options in the example batch script above (specifically -N and -j oe), the output of job with id <id> will go into a single file named kiat-imb.o<id> after the run completes.
  • As the queue software is currently configured, the temporary output of a job is available in a file named something like <id>.kidserv1.nics.utk.edu.OU in the directory from which the qsub was done. If you want to keep an eye on a job as it runs, you can tail this file.
    • If you are running from an NFS file system, when tail complains about a stale NFS handle, the job is done and the same output will be available in the .o<id> file described above.
    • If you are running from a GPFS file system (recommended), the system produces no such warning message when the job completes, though the tail program terminates.

Known Problems

  • When you compile, you may see the following warning from the Intel compiler:
    /opt/intel/Compiler/11.1/073/lib/intel64/libimf.so: warning: 
    warning: feupdateenv is not implemented and will always fail.
    This warning seems to be benign unless you are using fenv functions from C99. See the Intel forums for more discussion about this issue. (Note: adding "-shared-intel" avoids this warning, but causes your executable to use the shared object versions of the Intel libraries.)