The more accurately you can specify the required resources (memory, running time, local scratch needs, …) of your jobs, the better the job scheduler can serve your needs and often your jobs will be processed sooner. For instance, if you have a good sense of the amount of memory and run time your job needs, then you can specify these via SGE resource options
h_rt. If you don’t specify them, your job will use the default settings.
If you don’t know how much resources your job consumes, you can add
qstat -j $JOB_ID to the end of your job script. This will output a summary of your job to the job output log. Here is an example of a job that runs R, draws 100 million random numbers and calculates their summary statistics. We also call the garbage collector 1,000 times to emulate some CPU processing to give the scheduler enough time to snapshot the job. At the end, we output the job summary.
#!/bin/env bash #$ -cwd ## use current working directory #$ -j yes ## merge stdout and stderr ## Summarize 100 million random numbers in R, which occupies ## 100e6 * 8 bytes = 0.80 GB of RAM. ## We also run the garbage collector 1,000 times to ## emulate some CPU processing time Rscript -e "x <- rnorm(100e6); for (i in 1:1e3) gc(); summary(x)" ## End-of-job summary, if running as a job [[ -n "$JOB_ID" ]] && qstat -j "$JOB_ID"
As a first guess, we can assume that this script takes at most 5 minutes to run, but let’s assume we don’t have a good sense on how much memory it will consume, so we submit it as:
$ qsub -l h_rt=00:05:00 job_summary.sge Your job 2854740 ("job_summary.sge") has been submitted
When the job completes, we can find the resources as part of the output file:
$ grep "usage" job_summary.sge.o2854740 usage 1: cpu=00:00:14, mem=6.82412 GB s, io=0.00903 GB, vmem=810.203M, maxvmem=810.203M
The full details are available at the end.
With this information, we can narrow down that the total processing time was 14 seconds (
cpu=00:00:14) and that the maximum amount of virtual memory used was ~810 MB (
maxvmem=810.203M). With the help of
maxvmem from previous runs, we can re-submit this job script with more relevant resource specifications;
$ qsub -l h_rt=00:01:00 -l mem=1G job_summary.sge
Comment: Note that the
mem value has unit
GB s (GB * seconds), which is because it is the “accumulated memory usage of the job in Gbytes seconds”.
Here is an example of the job summary outputted from the above job script:
$ cat job_summary.sge.o2854740 Min. 1st Qu. Median Mean 3rd Qu. Max. -5.379004 -0.674312 0.000100 0.000069 0.674541 6.314641 ============================================================== job_number: 2854740 exec_file: job_scripts/2854740 submission_time: Fri Nov 2 22:32:57 2018 owner: alice uid: 59999 group: boblab gid: 34001 sge_o_home: /wynton/home/alice sge_o_log_name: alice sge_o_path: /opt/sge/bin:/opt/sge/bin/lx-amd64:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/wynton/home/alice/.local/bin:/wynton/home/alice/bin sge_o_shell: /bin/bash sge_o_workdir: /wynton/home/alice/ex sge_o_host: dev1 account: sge cwd: /wynton/home/alice/ex merge: y hard resource_list: h_rt=300 mail_list: alice@dev1 notify: FALSE job_name: job_summary.sge jobshare: 0 hard_queue_list: !gpu.q env_list: TERM=NONE script_file: job_summary.sge project: boblab binding: NONE job_type: NONE usage 1: cpu=00:00:14, mem=6.82412 GB s, io=0.00903 GB, vmem=810.203M, maxvmem=810.203M binding 1: NONE scheduling info: queue instance "long.q@msg-test3" dropped because it is temporarily not available queue instance "long.q@msg-iogpu14" dropped because it is temporarily not available ... queue instance "long.q@qb3-test3" dropped because it is disabled queue instance "long.q@qb3-test7" dropped because it is disabled ... queue instance "gpu.q@msg-iogpu12" dropped because it is full queue instance "gpu.q@msg-ihgpu2" dropped because it is full queue instance "gpu.q@msg-iogpu13" dropped because it is full
Sometimes your job “just dies”. There is often a simply explanation to this but finding out why can be complicated at first, especially if there are no clues in the job log files.
One common reason for jobs terminating early is that it ran out of the requested runtime (
-l h_rt=<runtime>). When that happens, the log files appear to have ended abruptly (which is not too far from the truth). It is tempting to try us
qstat -j <jobid> to find out more;
$ qstat -j 191442 Following jobs do not exist: 191442
Unfortunately, that does not work because the job no longer exist. Instead, we have to go look into the SGE logs. More specifically, we can scan the SGE “accounting” file, which records summaries of all jobs, for information on our terminated job. The SGE accounting file is huge so we don’t want to scan all of it. Instead, we search only the end of it but making a best guess of what “end of it” means. Below, we start at the 100,000 last rows and scan for our job. If no output is produced, try to increase the number of lines scanned.
$ tail -100000 /opt/sge/wynton/common/accounting | qacct -f - -j 191442 ============================================================== qname long.q hostname qb3-id95 group boblab owner alice project boblab department defaultdepartment jobname run.sge jobnumber 191442 taskid undefined account sge priority 19 qsub_time Wed Feb 12 20:59:18 2020 start_time Wed Feb 12 21:01:50 2020 end_time Wed Feb 12 22:01:50 2020 granted_pe smp slots 10 failed 37 : qmaster enforced h_rt, h_cpu, or h_vmem limit exit_status 137 (Killed) ru_wallclock 3600s ru_utime 1.373s ru_stime 0.883s ru_maxrss 58.508KB ru_ixrss 0.000B ru_ismrss 0.000B ru_idrss 0.000B ru_isrss 0.000B ru_minflt 194382 ru_majflt 44 ru_nswap 0 ru_inblock 26278 ru_oublock 563 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 10787 ru_nivcsw 316 cpu 12195.120s mem 4.030TBs io 379.956GB iow 0.000s maxvmem 20.610GB arid undefined ar_sub_time undefined category -u alice -q !gpu.q -l h_rt=3600,mem_free=25G -pe smp 10
When looking at this output, we first focus on the lines:
failed 37 : qmaster enforced h_rt, h_cpu, or h_vmem limit exit_status 137 (Killed)
First of all, the
exit_status line is not zero (
0); any software with an exit code other than zero indicates that something went wrong. It could be due to an error (typically
exit_status = 1), or as here
137 with suggests that the job was “killed”. If we look at
failed, we se that some rules were enforced, which in our case suggests that the rule for resource
h_rt was enforced.
Next, if we look at:
ru_wallclock 3600s ru_utime 1.373s ru_stime 0.883s
We see that the job maxed out at a
ru_wallclock runtime at 3600 seconds, i.e. 1 hour. This is indeed the maximum runtime requested, which we can infer from:
category -u alice -q !gpu.q -l h_rt=3600,mem_free=25G -pe smp 10
Conclusion: Job 191442 was killed by the job scheduler because it ran out of its requested runtime.
You can read about all
qacct output fields and what they mean in
man sge_status and