Single core jobs
By default, if you do not specify any particular amount of slots for your job, you will be allocated just 1 slot and your job will be automatically bound to it:
[login node] $ qsub worker.sh 120 1 job 2615 ("worker.sh") has been submitted [login node] $ ssh node052 [node052] $ top -u myuser PID USER %CPU COMMAND 18768 myuser 100 work
Be ware that even if your job is multi-threaded it will be restricted to just 1 slot, for example if you attempt to use 2 slots but forgot to request them:
login node$ qsub worker.sh 120 2 job 2614 ("worker.sh") has been submitted login node$ qstat -u myuser job-ID name user state queue slots 2614 worker.sh myuser r firstname.lastname@example.org 1 execution_node$ top -u myuser PID USER %CPU 18769 myuser 50.0 18770 myuser 50.0
So that only 1 slot is allocated and both threads will be restricted to run on the same slot.
Array of single-threaded tasks
Analogously, if you submit an array job you will be allocated just 1 slot per task.
login node$ qsub -t 1-2 -l h=node052 worker.sh 120 1 job 2614 ("worker.sh") has been submitted login_node $ qstat -u jlr41 job-ID name user state queue slots ja-task-ID 2614 worker.sh myuser r email@example.com 1 1 2614 worker.sh myuser r firstname.lastname@example.org 1 2 login_node $ ssh node052 node052 $ top -u myuser PID USER %CPU COMMAND 18767 myuser 100 work 18768 myuser 100 work
Each task will be bound to 1 slot. If a task tries to run 2 or more threads, all those threads per task will be bound to a single slot.
To find out which slots my tasks are bound to:
[login_node ]$ qstat -j 2614 | grep binding binding: set linear:1 binding 1: node052.cm.cluster=0,0 binding 2: node052.cm.cluster=1,0
which reads, task 1 on node052, socket 0, cpu 0, and so on.
In order to actually be allocated multiple slots, you should request them by using a parallel environment pe.
If you job is multi-threaded or if it is a shared memory parallel application you can make use of the openmp parallel environment to request multiple slots:
qsub -pe openmp 4 myjob.sh
where 4 here is the number of slots you need for your job.
To request 2 slots for a multi-threaded job then you do:
[login_node] $ qsub -pe openmp 2 -l h=node052 worker.sh 1200 2 [login_node] $ qstat -u myuser job-ID name user state queue slots 2618 worker.sh myuser r email@example.com 2 [login_node] $ qstat -j 2618 | grep binding binding: set linear:2 binding 1: node052.cm.cluster=0,0:0,1
So that the job is now bound to 2 slots as requested.
If the job tries to run 3 or more threads, all of them will be bound to the 2 slots already granted.
Array of multi-threaded tasks
Some times you might want to sent an array of parallel tasks. In that case the amount of slots requested are allocated in a per task basis too.
Suppose you want to request 2 slots per task on a 2-tasks array job:
[login_node] $ qsub -t 1-2 -pe openmp 2 -l h=node052 worker.sh 1200 2 job-ID name user state queue slots ja-task-ID 2619 worker.sh myuser r firstname.lastname@example.org 2 1 2619 worker.sh myuser r email@example.com 2 2 [login node] $ qstat -j 2619 | grep binding binding: set linear:2 binding 1: node052.cm.cluster=1,0:1,1 binding 2: node052.cm.cluster=0,2:0,3
First task then is running on node052, socket 1, slots 0 and 1, whereas the second task is running on node052, socket 0, slots 2 and 3.
Memory requests can be done at 2 levels
1. Main memory level: by using the m_mem_free resource request, which controls usage of the physical memory
2. Virtual memory level: by using the h_vmem resource request, which controls the total amount of memory usage (m_mem_free + swap space)
By default, i.e. if not requested otherwise, each slot is allocated 2GB of main memory, and a virtual memory of 2.5GB. If the process running on 1 slot goes beyond 2G, it will be swapped out, and if it goes beyond 2.5GB it will be simply killed and the job or task will fail.
In other words, virtual memory, h_vmem, will always be a hard limit, jobs will be killed when they try to use more than the amount granted.
Requesting more memory
You can request more memory by explicitly defining m_mem_free either within you submission script or along the submission line:
qsub -l m_mem_free=4G myjob.sh
In this case virtual memory h_vmem will be automatically limited to h_vmem=1.25*m_mem_free=5G. If the process exceeds 4G then it will be swapped out, whereas if the process exceeds 5G it will be killed.
Always use units of memory for your request, you can use either [K,M,G,k,m,g] for kB, MB, GB, kiB, MiB or GiB, respectively. If no units are used, the job will be rejected.
If you specify a particular amount of main memory m_mem_free as explained above, the system will always allocate 25% extra grace space in swap, regardless of the virtual memory requested.
Using swap has some performance implications though. Using Hard Disk as RAM will be so slow that the CPUs will be just waiting most of the time for Input/Output to proceed. If you want to avoid this behaviour then you can exclusively request virtual memory.
qsub -l m_hvmem=4G myjob.sh
That way, the system will automatically limit the main memory to m_mem_free=h_vmem, and the job won’t be swapped out, it will be killed when its main memory goes beyond 4GB though.
In general, memory requests are taken on a per slot basis wrt multi-thread or array jobs.
If you want to request 25G of main memory per task on a 3-task array job then you can do the following:
qsub -t 1-3 -l m_mem_free=25G myjob.sh
All the rules described above for single threaded jobs will apply here wrt memory limits, but in a per task basis.
Again, memory requests are taken on a per slot basis.
For example, say that you want to submit a 2-slots parallel job and you know each thread will need at least 8GB of RAM, you then request -l m_mem_free=8G but inadvertently, each thread uses 9GB of RAM:
[login_node] qsub -l m_mem_free=8G -pe openmp 2 -l h=node052 mem_pe.sh 9g [login_node] ssh node052 [node052] top -u myuser PID USER VIRT RES %CPU %MEM COMMAND 16365 myuser 9445732 8.232g 2.7 13.1 memhog 16364 myuser 9445732 8.089g 2.3 12.9 memhog
This job will be bound two 2 slots, each slot will have a physical memory limit of 8GB. By default though, total memory limit will be 10GB per task (remember that h_vmem=1.25*m_mem_free, by default). The job will try then to fill up 9GB of RAM per slot, it will use 8GB (see RES) of main memory and 1GB will be swapped out per thread.
RES in the top output stands for Resident Set Size, and accounts for the use of main memory, VIRT on the other hand accounts for the amount of memory the process reserved but have not used (not to be confused with h_vmem, which is the total amount of memory allocated by the scheduler).
Following job would fail though:
[login_node] qsub -l h_vmem=8G -pe openmp 2 -l h=node052 mem_pe.sh 9g Your job 2627 ("mem_pe.sh") has been submitted [login_node ] $ qacct -j 2627 | grep exit exit_status 137
Here the hard limit for main memory per task is (m_mem_free=h_vmem=8G) and the job will try to fill up 9GB per thread, the scheduler will kill the job by sending signal no 9, which will be reported as exist status exit code=128 + signal number.
Job classes are created by the cluster admins and contain preset default parameters for your job, in exactly the same way you would pass to qsub on the command line, eg: a job class might specify a default queue (-q serial.q) , a runtime (-l h_rt=1:00:00), and so on.
By submitting your job by specifying a job class, your job inherits all of the job class’ default parameters. If you specify any of the same parameters on the command line as well, you will override the job class’ defaults.
# This job will use the mps.short job class qsub -jc mps.short -q mps.q ...
To determine how much memory or time you need to request, it can sometimes help to look at a past job you have submitted and see how much resources that job used on the cluster. You can do this with the cluster’s accounting tool qacct.
# Here the tool is called giving the job ID as an argument qacct -j 6444841 ============================================================== qname admin.q hostname node108.cm.cluster group root owner root project NONE department defaultdepartment jobname hepspec jobnumber 6444841 taskid undefined account sge priority 0 cwd /lustre/scratch/sysadmin/hepspec submit_host feynman.cm.cluster submit_cmd qsub -q admin.q@node108 hepspec.job qsub_time 08/13/2015 13:20:56.645 start_time 08/13/2015 13:20:57.074 end_time 08/13/2015 19:17:04.598 granted_pe NONE slots 1 failed 0 deleted_by NONE exit_status 255 ru_wallclock 21367.524 ru_utime 1265488.522 ru_stime 4358.245 ru_maxrss 494720 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 543962721 ru_majflt 6458 ru_nswap 0 ru_inblock 38280 ru_oublock 122482440 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 5256562 ru_nivcsw 14159484 cpu 1269846.767 mem 153438.316 io 824.185 iow 0.000 maxvmem 42.640G maxrss 24.257G maxpss 23.980G arid undefined jc_name NONE
So we can see that the job ran for nearly 6 hours, or 21367.524 seconds as measured by ru_wallclock and it used 24.257G of memory at it’s peak (maxrss).