Job statistics¶

In addition to the standard Univa Grid Engine command-line utilities, we have an extensive stats site, for graphical reporting of all aspects of the various nodes and queues. You can also see summaries of your previous jobs, in order to ensure your configuration options are well-suited for your jobs.

Personal job history page¶

You can check your personal job history (including resources requested vs. resources used) on the Personal Job History page. Jobs with values highlighted in red show that the job exceeded one of its requested resource limits and was killed because of this.

The key things to check here are:

requested walltime vs walltime
requested memory vs. memory used
the exit status of the job.

Exit Status

A non-zero exit status of your job means that your job produced an error. It is important to check that your jobs exit with status of zero.

Exit Status

Sometimes you can see an exit status of 0 even though your job failed. The two main causes of this are either: a subsequent command exits successfully (for example exit or echo "finished"), or if your job contains sub-processes and the main process is not alerted if any sub-process fails. We strongly advise adding the set -e command immediately after the scheduler options to terminate the job as soon as any process fails.

Personal job history utility¶

You can check your personal job history (including resources requested vs. resources used) using the jobstats utility. By default it will show a list of your last 25 completed jobs. Usage information can be seen with the -h flag:

$ jobstats -h

USAGE: jobstats [ -a ] [ -b BEGIN_DATE ] [ -c ] [ -e END_DATE ] [ -f | -s ] [ -g | -m | -p "NODE PATTERN" ]
                [ -h ] [ -i csv|tsv|ssv ] [ -j JOB_NUMBER[.TID[-TID] ] | -u USER ] [ -l ] [ -n JOBS ]

OPTIONS:
 -a Show all jobs (no limit to output jobs)
 -b Show jobs started after DATE 00:00:00 (DATE format is "DD/MM/YY")
 -c Strip colours from output
 -e Show jobs started before DATE 23:59:59 (DATE format is "DD/MM/YY")
 -f Show only failed jobs. Can not be used together with -s option
 -g Show only GPU jobs. Can not be used together with -m and -p options
 -h Displays this help prompt and exits
 -i Prepare list of jobs for import to CSV (comma separated), TSV (tab separated) or SSV (semicolon separated) format
 -j Show JOB_NUMBER job with optional array task ID (TID) or array task range in numerical order
 -l Show less fields (to fit screen when using large fonts: SUBMITTED and STARTED fields omitted)
 -m Show only High Memory nodes jobs. Can not be used together with -g and -p options
 -n Show last JOBS jobs. (Default: 25)
 -p Show only nodes that matching pattern (wildcard allowed). Can not be used together with -g and -m options
 -s Show only successful jobs. Can not be used together with -f option
 -u Username to show jobs for. (Default: USERNAME)

Looking into the output you will be able to find out how resources were used during job execution.

$ jobstats

                                                                LAST 25 JOBS FOR USER abc123
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| JOB ID [TASK] |    NAME    |     SUBMITTED     |      STARTED      |       ENDED       |  TIME REQ |  DURATION | MEM R |  MEM U  | CORES | GPU |  QUEUE  |  HOST  | STATUS |  EFF |
+---------------+------------+-------------------+-------------------+-------------------+-----------+-----------+-------+---------+-------+-----+---------+--------+--------+------+
| 222222        |     QLOGIN | 28/10/21 09:30:16 | 28/10/21 09:30:17 | 28/10/21 10:30:18 |  01:00:00 |   1:00:01 |    1G |   0.01G |     1 |   - | short.q | ddy102 |    137 |  26% |
| 222223        |     QLOGIN | 28/10/21 10:38:34 | 28/10/21 10:38:35 | 28/10/21 11:09:18 |  01:00:00 |   0:30:43 |    1G |   0.01G |     1 |   - | short.q |  ddy40 |      0 |  75% |
| 222224.1      | trim_job_1 | 28/10/21 11:53:56 | 28/10/21 14:27:58 | 03/11/21 15:46:09 | 120:00:00 |   1:18:11 |   60G |   8.03G |     2 |   - |   all.q |   smf2 |      0 |  77% |
| 222224.2      | trim_job_1 | 28/10/21 11:53:56 | 28/10/21 14:27:58 | 03/11/21 15:46:09 | 120:00:00 |   1:18:11 |   60G |   8.03G |     2 |   - |   all.q |   smf2 |      0 |  77% |
| 222224.3      | trim_job_1 | 28/10/21 11:53:56 | 28/10/21 14:27:58 | 03/11/21 15:46:09 | 120:00:00 |   1:18:11 |   60G |   8.03G |     2 |   - |   all.q |   smf2 |      0 |  77% |
| 222225        | net_job_te | 28/10/21 06:14:50 | 28/10/21 06:14:51 | 28/10/21 09:27:53 |  72:00:00 |   3:13:02 |  320G |   0.22G |     5 |   - |   all.q |   srm2 |      0 |  20% |
| 222226        |  runscript | 28/10/21 17:59:41 | 28/10/21 17:59:42 | 18/11/21 17:59:43 | 240:00:00 | 240:00:01 |  EXCL |   0.09G | P 384 |   - |   all.q |  ddy74 |    137 |  99% |
| 222227        | research_j | 28/10/21 13:47:06 | 28/10/21 20:51:53 | 28/10/21 21:46:17 |  01:00:00 |   0:54:24 |   32G |   0.55G |    32 |   - | short.q |  ddy36 |      0 |  96% |
| 222228        | evil_thing | 28/10/21 11:15:08 | 28/10/21 00:17:12 | 28/10/21 01:06:21 |   0:49:09 |  02:00:00 |   60G |   4.71G |     8 |   1 |   all.q |   sbg1 |      0 | ~12% |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

If your terminal is set to display output in larger size fonts, you may use -l option to reduce the output table width (fields SUBMITTED and STARTED will be omitted) to fit the screen.

$ jobstats -l -n 5

                                                                LAST 5 JOBS FOR USER abc123
---------------------------------------------------------------------------------------------------------------------------------------------
| JOB ID [TASK] |    NAME    |       ENDED       |  TIME REQ |  DURATION | MEM R |  MEM U  | CORES | GPU |  QUEUE  |  HOST  | STATUS |  EFF |
+---------------+------------+-------------------+-----------+-----------+-------+---------+-------+-----+---------+--------+--------+------+
| 222224.3      | trim_job_1 | 03/11/21 15:46:09 | 120:00:00 |   1:18:11 |   60G |   8.03G |     2 |   - |   all.q |   smf2 |      0 |  77% |
| 222225        | net_job_te | 28/10/21 09:27:53 |  72:00:00 |   3:13:02 |  320G |   0.22G |     5 |   - |   all.q |   srm2 |      0 |  20% |
| 222226        |  runscript | 18/11/21 17:59:43 | 240:00:00 | 240:00:01 |  EXCL |   0.09G | P 384 |   - |   all.q |  ddy74 |    137 |  99% |
| 222227        | research_j | 28/10/21 21:46:17 |  01:00:00 |   0:54:24 |   32G |   0.55G |    32 |   - | short.q |  sdx36 |      0 |  96% |
| 222228        | evil_thing | 28/10/21 01:06:21 |   0:49:09 |  02:00:00 |   60G |   4.71G |     8 |   1 |   all.q |   sbg1 |      0 | ~12% |
---------------------------------------------------------------------------------------------------------------------------------------------

Short fields reference¶

JOB ID [TASK] - Job ID number with array task ID (array jobs only)
NAME - Job name
SUBMITTED - Date and time when job was submitted
STARTED - Date and time when job started execution
ENDED - Date and time when job completed execution
TIME REQ - Job runtime requested
DURATION - Actual job runtime (wallclock)
MEM R - Total memory requested
MEM U - Maximum amount of memory used during execution
CORES - Number of cores requested
GPU - Number of GPU 's requested (if no GPU is requested - will be shown)
QUEUE - The queue which accepted the job
HOST - The node which executed the job (only the master node is shown for multi-node parallel jobs)
STATUS - Job exit status
EFF - Shows how efficiently cores were utilised. For GPU jobs this parameter can not be used to assess efficiency so tilde (~) will be shown in front of value.

Checking the statistics of an individual job¶

The qacct command can give useful resource usage information on completed jobs.

The qacct -j <jobid> command is the most useful for checking exit status, memory usage, queue time, submission command and walltime.

RAM usage in qacct

The ru_maxrss field in the qacct command output displays the actual memory usage in GiB. We also provide the job-ram-usage -j <jobid> command to quickly see the real memory usage of a completed job.

You can also query jobs over a given period, for example, to display detailed output of every job run by user abc123 in the last 7 days:

qacct -d 7 -o abc123 -j

Omitting the -j will give a summary of resources used:

$ qacct -d 7 -o abc123
OWNER       WALLCLOCK        UTIME     STIME       CPU     MEMORY       IO      IOW
===================================================================================
abc123           8240     2844.621    49.198   2906.81   1352.935    2.356    0.380

For more detailed information about using the qacct command, please refer to its man page.

Walltime¶

Walltime is the length of time the job to execute. This does not include the time spent waiting in the job queue. If the job runs over the requested walltime it will be killed by the scheduler.

Currently the maximum walltime allowed on the standard queues is 10 days. If you need more time than this you will need to implement checkpointing in your code, saving the state of your job at regular intervals, allowing a job to be restarted from the point it was stopped at.

Maximum runtime

The maximum walltime of a job is 10 days to allow for planned system maintenance and updates. This is a global setting, therefore exceptions for individual jobs cannot be made. National and Regional HPC clusters use much shorter walltimes, measured in hours.

Memory usage¶

Jobs running over their memory limit will be killed by the scheduler. The maximum limit is defined by the physical memory on a compute node.

If your job is killed for breaching the requested memory limit it is important to understand why. If it is a job you have run before and is now suddenly failing due to excessive usage of memory, it is most likely a bug with the application. However if it is a new job it may require some tweaking to find the ideal memory value to request.

See the Tuning job requests page for assistance with finding the correct memory requirements for your job.

Job exit status¶

Cluster jobs which ran successfully will exit with code 0. Non-zero exit codes indicate there was a problem during execution and a command did not run successfully. A few common non-zero exit codes have been listed below with their recommended action before job re-submission.

Code	Error Description	Recommended Action
1	Application error	Miscellaneous errors, such as "divide by zero" and other impermissible operations. Check the job output file for errors e.g. invalid parameter
2	Misuse of shell built-ins	Missing keyword or command, or permission problem. Check the job output file for errors e.g. module load issues
126	Command invoked cannot execute	You are trying to execute a command that cannot be executed. Check the output file for errors
127	Command not found	You are trying to execute a command that cannot be found. Check the output file for errors
135	`SIGBUS` - Access to an undefined portion of a memory object	Increase the `h_vmem` value
137	`SIGTERM` - Job was killed	Increase the `h_vmem` value and ensure the maximum runtime is requested

If you are unsure about an error or exit code, you may contact us for assistance.