Running Programs on Mercury

When connecting to Mercury, a user is directed to a login node. The login node should never be used to run any computational tasks. Instead, jobs can be submitted from the login node to be run on Mercury’s compute nodes.

Mercury uses the Slurm scheduler to manage jobs. This page serves as a general outline of how to run jobs using Slurm on Mercury. For more application-specific tips, make sure to also visit our recipes page.

Interactive Login

Requesting an interactive session on Mercury can be accomplished using Slurm. This interactive session will persist until you disconnect from the compute node, or until you reach the maximum allocated time.

The general steps for running a program interactively are the following:

  1. Log in to Mercury (optionally with X11 forwarding)

  2. Request an interactive session on a compute node srun --pty bash --login

  3. Load the modules with the desired software

  4. Start your program

  5. When finished, quit the interactive session by typing exit

Note

You can add more parameters to your srun command if appropriate. For example: srun --account=phd --mem=2G --partition=standard --pty bash --login

Note

To avoid resource allocation errors, make sure you do not nest your interactive sessions! You should normally only request interactive sessions from a login node (i.e. mfe01 or mfe02).

Submitting Batch Jobs

The sbatch command is the command most commonly used to request computing resources on Mercury. Rather than specify all the options in the command line, users typically write a submission script that contains all the commands and parameters necessary to run the program on the cluster.

In a submission script, all Slurm parameters are declared with #SBATCH, followed by additional definitions.

Here is an example of a submission script:

submit.sh
 1#!/bin/bash
 2
 3#---------------------------------------------------------------------------------
 4# Account information
 5
 6#SBATCH --account=phd              # basic (default), staff, phd, faculty
 7
 8#---------------------------------------------------------------------------------
 9# Resources requested
10
11#SBATCH --partition=standard       # standard (default), long, gpu, mpi, highmem
12#SBATCH --cpus-per-task=1          # number of CPUs requested (for parallel tasks)
13#SBATCH --mem=2G           # requested memory
14#SBATCH --time=0-04:00:00          # wall clock limit (d-hh:mm:ss)
15
16#---------------------------------------------------------------------------------
17# Job specific name (helps organize and track progress of jobs)
18
19#SBATCH --job-name=my_batch_job    # user-defined job name
20
21#---------------------------------------------------------------------------------
22# Print some useful variables
23
24echo "Job ID: $SLURM_JOB_ID"
25echo "Job User: $SLURM_JOB_USER"
26echo "Num Cores: $SLURM_JOB_CPUS_PER_NODE"
27
28#---------------------------------------------------------------------------------
29# Load necessary modules for the job
30
31module load <modulename>
32
33#---------------------------------------------------------------------------------
34# Commands to execute below...
35
36<commands>

Typing sbatch submit.sh at the command line will submit the batch job using the scheduler.

Note

Although lines 2–27 are optional, their use is highly recommended. Omitting the SBATCH parameters will cause jobs to be scheduled with the lowest priority and will allocate limited resources to your jobs.

If at any moment you need to verify which accounts and qos you have access to, you may view your association:

$ sacctmgr show association where user=johndoe

Cluster    Account       User       QOS   Def QOS
------- ---------- ---------- --------- ---------
mercury        mba    johndoe      clay      clay

Submitting Array of Jobs

It is sometimes necessary to submit a collection of similar jobs. This can be accomplished by passing array indices to the sbatch function. For example, to pass the array indices to a python program would be as simple as typing: sbatch --array=0,1,4 submit.sh or sbatch --array=0-10 submith.sh

submit.sh
1#!/bin/bash
2
3# Load the software module
4module load python/booth/3.6/3.6.12
5
6# Pass the array index to my program of choice
7echo "Array ID: $SLURM_ARRAY_TASK_ID"
8srun python3 myscript.py $SLURM_ARRAY_TASK_ID

Managing Jobs

The Slurm job scheduler provides several command-line tools for checking on the status of your jobs and for managing them. For a complete list of Slurm commands, see the Slurm man pages. Here are a few commands that you may find particularly useful:

The most common commands can be summarized as:

srun

Obtain a job allocation and execute an application

sbatch

Submit a batch script to the slurm scheduler.

sacct

retrieve job history and information about past jobs

scancel

cancel jobs you have submitted

squeue

find out the status of queued jobs

sinfo

view information about Slurm nodes and partitions.

Use the squeue command to check on the status of your jobs, and other jobs running on Mercury. The simplest invocation lists all jobs that are currently running or waiting in the job queue (“pending”), along with details about each job such as the job id and the number of nodes requested:

$ squeue

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  924  standard  probe_A vargaslo  R       0:35      1 mcn06
  925  standard  probe_B vargaslo  R       0:34      1 mcn06
  926  standard  survey1   hchyde  R       0:32      1 mcn06
  927  standard  probe_C vargaslo  R       0:30      1 mcn06
  928  standard  survey2   hchyde PD       0:00      1 (Resources)

Any job with 0:00 under the TIME column is a job that is still waiting in the queue. In the above case, there are not enough resources to run all jobs, so job 928 is waiting in the queue. (Resources) means that the job is waiting in the queue for available resources to free up. Another common reason for a job to be waiting in the queue is if other jobs have higher priority. This will be listed as (Priority).

To view only the jobs that you have submitted, use the --user flag

$ squeue --user=hchyde

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  926  standard  survey1   hchyde  R       0:32      1 mcn06
  928  standard  survey2   hchyde PD       0:00      1 (Resources)

This command has many other useful options for querying the status of the queue and getting information about individual jobs. For example, to get information about all jobs that are waiting to run on the standard partition, enter:

$ squeue --state=PENDING --partition=standard

Alternatively, to get information about all your jobs that are running on the standard partition, type:

$ squeue --state=RUNNING --partition=standard

The last column of the output tells us which nodes are allocated for each job. For more information, consult the command-line help by typing squeue --help, or visit the official online documentation.

To cancel a job you have submitted, use the scancel command. This requires you to specify the id of the job you wish to cancel. For example, to cancel a job with id 8885128, do the following:

$ scancel 8885128

If you are unsure what is the id of the job you would like to cancel, see the JOBID column from running squeue --user=<username>.

To cancel all jobs you have submitted that are either running or waiting in the queue, enter the following:

$ scancel --user=<username>

Jobs that have completed can be viewed using sacct. This can be useful for example to see how much memory a job consumed.

# view job with jobid 993
$ sacct -j 993

# view amount of memory used during job execution
$ sacct -j <jobID> --format=User,MaxRss,MaxVMSize,Jobname,partition,start,end

# view all jobs started after a specific date for a specific user
$ sacct --starttime=2018-06-10 --user=<BoothID>