Quick Start Guide
- System Overview
- Getting a NICS Account
- Getting Help
- Logging In
- Configuring your Environment
- File Systems and Storage
- Software Development
- Running Jobs
- Known Problems
System Overview
The Keeneland Initial Delivery (KID) system, which was delivered in October 2010. It is composed of an HP SL-390 (Ariston) cluster with Intel Westmere hex-core CPUs, NVIDIA 6GB Fermi GPUs, and a Qlogic QDR InfiniBand interconnect. Each node has two hex-core CPUs and 3 GPUs, with a total of 120 nodes, 240 CPUs and 360 GPUs.
Jobs are charged like so:
1 node-hr = 16 (KFS) CPU-hrs = 3 GPU-hrs = 3 SUs.
Getting a NICS Account
Please see Getting Access to KIDS for details on getting an account.
Once you have an account, you will be added to the Keeneland Users mailing list. System-wide announcements will broadcast to this list.
Getting Help
Please direct any questions to help@xsede.org. To ensure your question gets routed correctly, please include "Keeneland" in the subject line.
Logging In
To login to the KID system, SSH to kids.gatech.xsede.org using your NICS account as your username and your 'PIN+token code' for your PASSCODE.
Configuring your Environment
Modules
On Keeneland, modules are used to manage the environment, for example, changing PATH
or LD_LIBRARY_PATH
to use different applications or libraries. Of particular note are the PE-
modules which affect the compiler vendor. Modules for libraries often check this module to determine what version of the library to use.
For more information, including a list of commands, see Modules.
File Systems and Storage
Home Directories
Each user is provided with a home directory to store frequently used items such as source code, binaries, and scripts. Home directories are shared among all NICS resources, for more information see NICS Home Directories.
Groups may also request NFS project directories. These are intended to share files among a group, see NICS Project Directories.
Scratch Directories
Scratch directories are on a parallel file system, intended to provide high performance access to temporary input and output files. There is no quota, however, files that have not been accessed in 30 days may be purged. Scratch directories on Keeneland are provided at:
/lustre/medusa/<username>
For more information, see Lustre.
Software Development
As with most other software on the Keeneland system, the software development toolchain packages are managed using modules. See the modules section for more information.
Compilers
There are several compilers available on the Keeneland ID system: Intel, GNU, and PGI.
The GNU compilers are installed in system default locations, and thus are always in the user's PATH
, though the PE-gnu
module is required in order for mpicc
to use gcc
.
Note that only certain versions of the PGI compilers support PGI accelerator directives and CUDA Fortran.
CUDA
We have CUDA and NVIDIA GPU Computing SDK on the system.
MPI
There are a few MPI implementations available on the Keeneland system: OpenMPI, MVAPICH2, and MPICH2.
Select one of these MPI implementations using a command like module load openmpi/1.5.1-intel.
Note that there are also MPI implementations installed as part of the Open Fabrics Enterprise Distribution (OFED) software stack on the Keeneland ID login nodes in directories under /usr/mpi
. These installations will not work correctly on the Keeneland ID system, because they have not been built to be integrated with the resource management software used on the Keeneland system (i.e., Torque).
See Running Jobs for information about launching MPI-based programs from batch jobs.
Control Version Systems
We have subversion and git on the system.
Running Jobs
Batch Jobs
Keeneland uses Torque (an open source PBS derivative) as its batch queue software, with the Moab scheduler, similar to other systems at NICS. There are some important differences, as described below. Here's an example batch queue script (see the notes afterward for some explanation). This assumes that you have set up the modules in your .bash_profile
as described in the Modules section of this document.
#!/bin/sh #PBS -N kiat-imb #PBS -j oe #PBS -A UT-TENN0000 ### Unused PBS options ### ## If left commented, must be specified when the job is submitted: ## 'qsub -l walltime=hh:mm:ss,nodes=2:ppn=12:gpus=3:shared ' ## ##PBS -l walltime=00:30:00 ##PBS -l nodes=2:ppn=12:gpus=3:shared ### End of PBS options ### date cd $PBS_O_WORKDIR echo "nodefile=" cat $PBS_NODEFILE echo "=end nodefile" # run the program which mpirun mpirun --mca mpi_paffinity_alone 1 /bin/hostname date # eof
Notes on Batch Scripts
-
The scheduler is set up to give exclusive access to nodes, so there should be no need to add a flag (like "
-l naccesspolicy=singletask
") to ensure each job gets its node to itself. -
A
-S
parameter to PBS is required if you want to use a shell other than bash. Adding something like#!/bin/ksh
in the first line is not enough to choose a different shell.- If you write batch scripts for another shell than bash, you must be sure that the module setup has been done as described in Modules.
- If you are sharing your script with anyone else you must be sure that everyone who uses your script has done this setup. Since this is a burden and error prone, you might want to do the module setup explicitly in the batch script if you are using a non-bash shell for your batch scripts.
- The account number is required. The account number is the same number as the project(s) to which your NICS account is tied to.
-
If you have your environment set up correctly, and are using the OpenMPI from
/sw/keeneland/openmpi/1.5.1-intel
(check the output of the 'which mpirun' from running this script, you should not need to pass either '-np 2
' or '-hostfile $PBS_NODEFILE
' to thempirun
command. If yourmpirun
's don't work, it may be that your environment is trying to use the wrongmpirun
that was not built with Torque integration -
This job script does not hard-code the number of nodes or processes per node, so these need to be specified on the command line. If you wish to specify this in the batch script, add:
#PBS -l walltime=hh:mm:ss,nodes=2:ppn=12:gpus=3:shared
-
See NUMA for information on the
mpi_paffinity_alone
parameter.
NUMA
OpenMPI has optional NUMA support. It has to be built into OpenMPI at compile time, and isn't configured to do so by default. If it isn't built in, we have been using mpirun
to start shell scripts that attempt to use numactl to control process and memory placement. If it is built in, the mechanism is much simpler: pass '--mca mpi_paffinity_alone 1
' to the mpirun
when you start your program, and don't use a separate script.
Check if the OpenMPI you are using has NUMA built in with:
ompi_info | grep affinity
If it is built into the OpenMPI you are using, there will be lines like:
MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.4.3) MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.4.3) MCA maffinity: libnuma (MCA v2.0, API v2.0, Component v1.4.3)
Launching Jobs
Submit jobs with the qsub
command. This example submits a job which uses 4 processes per node on 12 nodes for 30 minutes:
qsub -l walltime=00:30:00,nodes=2:ppn=12:gpus=3:shared kiat-imb.ksh
Queues
There are several queues defined for the ID system. The output of qstat -q
shows the queues, their restrictions, and their state (e.g., 'enabled', 'running'):
kidlogin1.nics.utk.edu$ qstat -q server: kidserv1.nics.utk.edu Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- --- --- -- ----- serial -- -- 48:00:00 1 0 0 -- E R hpss -- -- 48:00:00 -- 0 0 -- E R capability -- -- 48:00:00 110 0 0 -- E S parallel -- -- 48:00:00 60 0 0 -- E R dmover -- -- 48:00:00 -- 0 0 -- E R batch -- -- -- -- 0 0 -- E R ----- ----- 0 1
-
Use '
qstat -a
' (or simply 'qstat
' if you prefer the default output) to see jobs in the batch queues. -
Use '
qstat -f
' to see full information about all jobs, 'qstat -f id
' to see full information about job with id 'id' -
The Moab command '
showq
' shows the scheduler's view of the queues. Both 'qstat -f
' and 'showq
' complement each other to tell you why your job isn't running. -
The Moab command '
checkjob id
' can also help troubleshoot problems with job 'id'. -
There is currently no need to specify a queue in the current configuration. Jobs get placed in a specific queue based on their size.
- Specifically, the 'batch' queue is a gateway queue. Submit to the batch queue, and Torque figures out which other queue that the job belongs in. Empirically, it seems like if the job requests 1 node, it ends up in the 'serial' queue. If it requests between 1 and 72 nodes, it ends up in the parallel queue. If more than 72, it ends up in the capability queue. The hpss and dmover queues are special-purpose, and you should never choose these.
Output
-
With the PBS options in the example batch script above (specifically
-N
and-j oe
), the output of job with id <id> will go into a single file namedkiat-imb.o<id>
after the run completes. -
As the queue software is currently configured, the temporary output of a job is available in a file named something like
<id>.kidserv1.nics.utk.edu.OU
in the directory from which theqsub
was done. If you want to keep an eye on a job as it runs, you cantail
this file.-
If you are running from an NFS file system, when
tail
complains about a stale NFS handle, the job is done and the same output will be available in the.o<id>
file described above. -
If you are running from a GPFS file system (recommended), the system produces no such warning message when the job completes, though the
tail
program terminates.
-
If you are running from an NFS file system, when
Known Problems
-
When you compile, you may see the following warning from the Intel compiler:
/opt/intel/Compiler/11.1/073/lib/intel64/libimf.so: warning: warning: feupdateenv is not implemented and will always fail.
This warning seems to be benign unless you are using fenv functions from C99. See the Intel forums for more discussion about this issue. (Note: adding "-shared-intel
" avoids this warning, but causes your executable to use the shared object versions of the Intel libraries.)