Running Programs on Mercury
When connecting to Mercury, a user is directed to a login node. The login node is useful for viewing your home directory or for submitting computational tasks on Mercury’s compute nodes. The login node should never be used to directly run any computational tasks.
Mercury uses the Slurm scheduler to manage jobs. This page serves as a general outline of how to run jobs using Slurm on Mercury. For more application-specific tips, make sure to also visit our recipes page.
Mercury Accounts
Before running any jobs on the Mercury cluster, all users must be associated with an account.
The scheduler uses the account information to prioritize each job and ensure there is fair share usage among all users.
To view your accounts, copy and paste the following line in your Terminal: sacctmgr show association where user=$USER -nP Format=account
If you have not yet been assigned an account to use, please contact us at Research.Support@chicagobooth.edu. In the email, please let us know who is sponsoring your research on Mercury whether it be a Booth professor or a Booth research center (e.g. Kilts Center, Fama-Miller Center).
Interactive Login
Requesting an interactive session on Mercury can be accomplished using Slurm. This interactive session will persist until you disconnect from the compute node, or until you reach the maximum allocated time.
The general steps for running a program interactively are the following:
Log in to Mercury (optionally with X11 forwarding)
Request an interactive session on a compute node
srun --account=<accountname> --pty bash --login
Load the modules with the desired software
Start your program
When finished, quit the interactive session by typing
exit
Note
You can add more parameters to your srun command if appropriate.
For example: srun --account=phd --mem=2G --partition=standard --pty bash --login
Note
To avoid resource allocation errors, make sure you do not nest your interactive sessions! You should normally only request interactive sessions from a login node (i.e. mfe01 or mfe02).
Submitting Batch Jobs
The sbatch
command is the command most commonly used to request computing resources on Mercury.
Rather than specify all the options in the command line, users typically write a submission script that contains all the commands and parameters necessary to run the program on the cluster.
In a submission script, all Slurm parameters are declared with #SBATCH
, followed by additional definitions.
Here is an example of a submission script:
1#!/bin/bash
2
3#---------------------------------------------------------------------------------
4# Account information
5
6#SBATCH --account=phd # basic (default), phd, faculty, pi-<account>
7
8#---------------------------------------------------------------------------------
9# Resources requested
10
11#SBATCH --partition=standard # standard (default), long, gpu, mpi, highmem
12#SBATCH --cpus-per-task=1 # number of CPUs requested (for parallel tasks)
13#SBATCH --mem=2G # requested memory
14#SBATCH --time=0-04:00:00 # wall clock limit (d-hh:mm:ss)
15
16#---------------------------------------------------------------------------------
17# Job specific name (helps organize and track progress of jobs)
18
19#SBATCH --job-name=my_batch_job # user-defined job name
20
21#---------------------------------------------------------------------------------
22# Print some useful variables
23
24echo "Job ID: $SLURM_JOB_ID"
25echo "Job User: $SLURM_JOB_USER"
26echo "Num Cores: $SLURM_JOB_CPUS_PER_NODE"
27
28#---------------------------------------------------------------------------------
29# Load necessary modules for the job
30
31module load <modulename>
32
33#---------------------------------------------------------------------------------
34# Commands to execute below...
35
36<commands>
Typing sbatch submit.sh
at the command line will submit the batch job using the scheduler.
Note
Although lines 2–27 are optional, their use is highly recommended. Omitting the SBATCH parameters will cause jobs to be scheduled with the lowest priority and will allocate limited resources to your jobs.
If at any moment you need to verify which accounts and qos you have access to, you may view your association:
$ sacctmgr show association where user=johndoe
Cluster Account User QOS Def QOS
------- ---------- ---------- --------- ---------
mercury mba johndoe clay clay
Submitting Array of Jobs
It is sometimes necessary to submit a collection of similar jobs.
This can be accomplished using job arrays with the --array
option.
In the example below, four separate jobs will be launched by the scheduler each with a unique $SLURM_ARRAY_TASK_ID
.
1#!/bin/bash
2
3#SBATCH --array=0-3
4
5# Load the software module
6module load python/booth/3.10
7
8# Access unique array task ID using environment variable
9echo "Array ID: $SLURM_ARRAY_TASK_ID"
10srun python3 -c "import os; x=os.environ['SLURM_ARRAY_TASK_ID']; print(x)"
Managing Jobs
The Slurm job scheduler provides several command-line tools for checking on the status of your jobs and for managing them. For a complete list of Slurm commands, see the Slurm man pages. Here are a few commands that you may find particularly useful:
The most common commands can be summarized as:
- srun:
Obtain a job allocation and execute an application
- sbatch:
Submit a batch script to the slurm scheduler.
- sacct:
retrieve job history and information about past jobs
- scancel:
cancel jobs you have submitted
- squeue:
find out the status of queued jobs
- sinfo:
view information about Slurm nodes and partitions.
Use the squeue
command to check on the status of your jobs, and other jobs running on Mercury.
The simplest invocation lists all jobs that are currently running or waiting in the job queue (“pending”), along with details about each job such as the job id and the number of nodes requested:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
924 standard probe_A vargaslo R 0:35 1 mcn06
925 standard probe_B vargaslo R 0:34 1 mcn06
926 standard survey1 hchyde R 0:32 1 mcn06
927 standard probe_C vargaslo R 0:30 1 mcn06
928 standard survey2 hchyde PD 0:00 1 (Resources)
Any job with 0:00 under the TIME column is a job that is still waiting in the queue.
In the above case, there are not enough resources to run all jobs, so job 928 is waiting in the queue.
(Resources)
means that the job is waiting in the queue for available resources to free up.
Another common reason for a job to be waiting in the queue is if other jobs have higher priority.
This will be listed as (Priority)
.
To view only the jobs that you have submitted, use the --user
flag
$ squeue --user=hchyde
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
926 standard survey1 hchyde R 0:32 1 mcn06
928 standard survey2 hchyde PD 0:00 1 (Resources)
This command has many other useful options for querying the status of the queue and getting information about individual jobs. For example, to get information about all jobs that are waiting to run on the standard partition, enter:
$ squeue --state=PENDING --partition=standard
Alternatively, to get information about all your jobs that are running on the standard partition, type:
$ squeue --state=RUNNING --partition=standard
The last column of the output tells us which nodes are allocated for each job.
For more information, consult the command-line help by typing squeue --help
, or visit the official online documentation.
To cancel a job you have submitted, use the scancel
command.
This requires you to specify the id of the job you wish to cancel.
For example, to cancel a job with id 8885128, do the following:
$ scancel 8885128
If you are unsure what is the id of the job you would like to cancel, see the JOBID column from running squeue --user=<username>
.
To cancel all jobs you have submitted that are either running or waiting in the queue, enter the following:
$ scancel --user=<username>
Jobs that have completed can be viewed using sacct
.
This can be useful for example to see how much memory a job consumed.
# view job with jobid 993
$ sacct -j 993
# view amount of memory used during job execution
$ sacct -j <jobID> --format=User,MaxRss,MaxVMSize,Jobname,partition,start,end
# view all jobs started after a specific date for a specific user
$ sacct --starttime=2018-06-10 --user=<BoothID>