Parallel Jobs

To submit parallel jobs, you must request a parallel environment, pe, at submission through the flag -pe [environment] [options] of qsub. This can be done either directly in the command line, or as a line in the job script: #$ -pe [environment] [options]. The specific environment and options depend on the kind of parallel job that you want to run.

OpenMP: Dynamic threading, shared memory jobs

OpenMP jobs are launched as a single process, that spreads in /threads/ during particular stages of the code, e.g. loops, to make them faster. All threads share the same memory with the master process.

To run an OpenMP job that can make use of up to n threads, use the openmp parallel environment:

-pe openmp [n]

preceded by #$ if added in the job script.

This will reserve n slots on a single node and launch your job. You may ssh into that node (use qstat [job_id] to identify it) and use the top program to check that your job is actually threading as expected.

Resource requests made with the -l flag are per slot; e.g.

#$ -pe openmp 4
#$ -l m_mem_free=4G

reserves 4 slots and a total of 16Gb of memory.

MPI: Separate-memory, communicating processes

If you code is MPI-capable, you may take advantage of the cluster engine for spreading your jobs across multiple nodes automatically, i.e. without having to create an MPI /hosts file/.

First, you need to make use of an MPI-capable parallel environment (typically, one with mpi as part of its name. To list the parallel environments available in a particular queue, do

qconf -sq [queue.q]

and look for the pe_list field (if you use grep, make sure to use the -A [X] flag to print some lines after the match, as pe_list may cover multiple lines). E.g. for mps.q, at the time of writing this, we find available MPI-capable environments for the two common MPI implementations: OpenMPI (openmpi, openmpi_ib, openmpi_mixed(_X)`) and MPICH (mpich, mvapich`).

If you want to run X MPI processes using OpenMPI, you would use the command line option

-pe openmpi [X]

and run your job using the mpirun script with the -np [X] command line option (or -n [X], depending on the MPI implementation) and the -rmk sge option to get the MPI /hosts file/ automatically from the cluster engine:

mpirun -np [X] -rmk sge /path/to/your/

This would distribute X processes across as many nodes as necessary.

As with OpenMP, resource requests are per slot; e.g. -pe openmpi 4 -l m_mem_free=4G reserves 4 slots and a total of 16Gb of memory, distributed across nodes in the same way your slots are.

Hybrid jobs: MPI+OpenMP

Commonly in complex codes, both approaches above are combined: rather independent processes are launched using MPI parallelisation, and in turn each one of them uses OpenMP threading to accelerate some computations.

To make use of this hybrid approach, we need to /under-subscribe/ MPI slots. To understand this, let’s take a look at the technical description of a parallel environment pe using gconf -sp [pe]; in particular, pay attention to the allocation_rule property.

The pe used above, openmpi, has a /fill-up/ allocation rule: it will sequentially send MPI processes into a node until all its slots are full, and then start requesting slots from the next node with available slots.

In contrast, the environments listed as openmpi_mixed_[X] have an allocation rule of value X; this means that if I wanted to launch Y MPI processes using

#$ -pe openmpi_mixed_[X] [n*X]
mpirun -np [Y] /path/to/your/

where n*X is the first multiple of X which is larger than Y, those Y processes would be allocated in chunks of size X, and each of those chunks would automatically be sent to as many nodes as necessary.

The interesting bit there is that if n*X is even larger, the Y processes are chunked into as many chunks as possible; e.g.

#$ -pe openmpi_mixed_4 20
mpirun -np 5 /path/to/your/

would allocate five (=20/4) 4-slots chunks, on each of which a single MPI process is run. If your job uses OpenMP, each MPI process will have 4 cores available (the one used by the master process plus 3 free ones).

If you are sending more than an MPI process per chunk, you may want to limit the maximum number of OpenMP threads, such that the competing MPI processes within one chunk make a balanced use of the resources. To set that add export OMP_NUM_THREADS=[Z] before calling your job in the script. Ideally, since processes are not threaded all the time, Z should be somewhat larger than the fraction of the slots available for each MPI process within each chunk.

As always, resource requests are per slot; e.g.

#$ -pe openmpi_mixed_4 20
#$ -l m_mem_free=1G
mpirun -np 5 /path/to/your/

would reserve 20*1Gb, 4Gb for each of the 5 chunks.