Mercury Architecture and Usage Limits

Partitions and Limits

Mercury is made up of compute nodes with a variety of architectures and configurations. A partition is a collection of compute nodes that all have the same, or similar, architecture and configuration. While the standard partition will meet most users’ needs, we also offer specialized partitions for specific purposes. The long partition accommodates jobs that require longer wall clock time. The highmem partition is suitable for jobs requiring more than 32 GB per core. The gpu_h100 partition is to be used only when running jobs utilizing the Nvidia H100 GPU cards. Currently, Mercury is configured with the following partitions:

Partition

Nodes

Cores

Mem-per-CPU

Wall clock

standard

Def: 1
Max: 1
Def: 1
Max: 64
Def: 2GB
Max: 32GB
Def: 4h
Max: 7d

long

Def: 1
Max: 1
Def: 1
Max: 24
Def: 2GB
Max: 32GB
Def: 1d
Max: 30d

highmem

Def: 1
Max: 1
Def: 1
Max: 32
Def: 32GB
Max: 512GB
Def: 4h
Max: 2d

gpu_h100

Def: 1
Max: 1
Def: 1
Max: 28
Def: 2GB
Max: 242GB
Def: 4h
Max: 2d

interactive

Def: 1
Max: 1
Def: 1
Max: 2
Def: 2GB
Max: 32GB
Def: 2h
Max: 4h

To see a list of available partitions, use the sinfo command:

$ sinfo

Each user association has a concurrent service unit (SU) limit based on the account that was used to submit the job. See the table column labeled ‘Job Factor’ for an indication of how to estimate the number of SUs for a given job. Notice that jobs submitted to non-standard partitions incur a higher billing factor and thus reduce the number of jobs that are allowed to run concurrently.

Affiliation

Concurrent Limits

Default (--account=basic)

2,000 SU

PhD (--account=phd)

320,000 SU

Collaborator (--account=pi-*)

320,000 SU

Faculty (--account=faculty)

320,000 SU

Partition

Job Billing Factor

Partition Limits

standard

max{1000 x Ncpus; 125 x GBmem}

N/A

long

max{1000 x Ncpus; 125 x GBmem}

N/A

highmem

max{1000 x Ncpus; 32 x GBmem}

48,000 SU

gpu_h100

max{364 x Ncpus; 32 x GBmem; 11666 x Ngpus}

35,000 SU

Note

Concurrent and partition limits are subject to change based on cluster usage.

If at any moment you need to verify which accounts and qos you have access to, you may view your association:

$ sacctmgr show association where user=<BoothID> format=cluster,account%24,user%24,qos

Cluster                  Account                     User                  QOS
------- ------------------------ ------------------------ --------------------
mercury                      phd                <BoothID>               bronze
mercury                    basic                <BoothID>                 clay

Scratch Space

Mercury has 6 TB of shared scratch space in /scratch. This is the recommended place to store temporary job files such as transformed data, temporary logs, parallel job metadata or intermediate output. It is not to be used for file/data storage not related to running jobs. We recommend placing your scratch files within a personal folder organized by job number: /scratch/$SLURM_JOB_USER/$SLURM_JOB_ID/. Here, ${SLURM_JOB_USER} and ${SLURM_JOB_ID} are environment variables which you can query in your job script. Your code should automatically delete the job-specific files and folder upon successful completion.

Note

The scratch directory is only available on the compute nodes, not on the front end nodes.

Warning

You should delete all unused scratch files as soon as they are no longer needed. All scratch files will be automatically deleted 35 days after creation without notice. If scratch space fills up, the oldest files will be deleted first without notice.

Below is an example of creating a temporary directory in /scratch and deleting it once the job finishes.

#!/bin/bash

#SBATCH --job-name=mystatajob  # name of job
#SBATCH --partition=standard   # assign the job to the "standard" partition

# create a new scratch directory for this job
scratch_dir="/scratch/${SLURM_JOB_USER}/${SLURM_JOB_ID}"
mkdir -p $scratch_dir

# use scratch dir to store tmp files
export STATATMP=$scratch_dir

# run stata
dofile="choosevars.do"
/apps/bin/stataMP  -b do $PWD/$dofile

# remove scratch directory when done
rm -r $scratch_dir