Running Programs on Mercury
===========================

When connecting to Mercury, a user is directed to a `login` node.
The `login` node is useful for viewing your home directory or for submitting computational tasks on Mercury's `compute` nodes.
The login node should never be used to directly run any computational tasks.

.. _Slurm: http://www.slurm.schedmd.com/documentation.html

Mercury uses the `Slurm`_ scheduler to manage jobs.
This page serves as a general outline of how to run jobs using Slurm on Mercury.
For more application-specific tips, make sure to also visit our `recipes`_ page.

.. _recipes: https://hpc-docs.chicagobooth.edu/applications.html


Mercury Accounts
----------------
Before running any jobs on the Mercury cluster, all users must be associated with an `account`.
The scheduler uses the account information to prioritize each job and ensure there is fair share usage among all users.
If you do not know which account to use, please contact us at Research.Support@chicagobooth.edu.
In the email, please let us know who is sponsoring your research on Mercury whether it be a Booth professor or a Booth research center (e.g. Kilts Center, Fama-Miller Center).


Interactive Login
-----------------

Requesting an interactive session on Mercury can be accomplished using Slurm.
This interactive session will persist until you disconnect from the compute node, or until you reach the maximum allocated time.

The general steps for running a program interactively are the following:

#. `Log in`_ to Mercury (optionally with `X11 forwarding`_)
#. Request an interactive session on a compute node :code:`srun --account=<accountname> --pty bash --login`
#. `Load the modules`_ with the desired software
#. Start your program
#. When finished, quit the interactive session by typing :code:`exit`


.. _Log in: connecting.html#connecting-to-mercury
.. _X11 forwarding: connecting.html#x11-forwarding
.. _Load the modules: modules.html#available-software-on-mercury

.. note:: You can add more parameters to your srun command if appropriate.
          For example: :code:`srun --account=phd --mem=2G --partition=standard --pty bash --login`

.. note:: To avoid resource allocation errors, make sure you do not nest your interactive sessions!
          You should normally only request interactive sessions from a login node (i.e. mfe01 or mfe02).


Submitting Batch Jobs
---------------------

The :code:`sbatch` command is the command most commonly used to request computing resources on Mercury. 
Rather than specify all the options in the command line, users typically write a submission script that contains all the commands and parameters necessary to run the program on the cluster.

In a submission script, all Slurm parameters are declared with :code:`#SBATCH`, followed by additional definitions.

Here is an example of a submission script:

.. code-block:: bash
   :caption: submit.sh
   :linenos:

   #!/bin/bash
   
   #---------------------------------------------------------------------------------
   # Account information
   
   #SBATCH --account=phd              # basic (default), phd, faculty, pi-<account>
   
   #---------------------------------------------------------------------------------
   # Resources requested
   
   #SBATCH --partition=standard       # standard (default), long, gpu, mpi, highmem
   #SBATCH --cpus-per-task=1          # number of CPUs requested (for parallel tasks)
   #SBATCH --mem=2G           # requested memory
   #SBATCH --time=0-04:00:00          # wall clock limit (d-hh:mm:ss)
   
   #---------------------------------------------------------------------------------
   # Job specific name (helps organize and track progress of jobs)   

   #SBATCH --job-name=my_batch_job    # user-defined job name
   
   #---------------------------------------------------------------------------------
   # Print some useful variables
   
   echo "Job ID: $SLURM_JOB_ID"
   echo "Job User: $SLURM_JOB_USER"
   echo "Num Cores: $SLURM_JOB_CPUS_PER_NODE"
   
   #---------------------------------------------------------------------------------
   # Load necessary modules for the job
   
   module load <modulename>
   
   #---------------------------------------------------------------------------------
   # Commands to execute below...

   <commands>

Typing :code:`sbatch submit.sh` at the command line will submit the batch job using the scheduler.


.. note:: 
    Although lines 2--27 are optional, their use is highly recommended.
    Omitting the SBATCH parameters will cause jobs to be scheduled with the lowest priority and will allocate limited resources to your jobs.

If at any moment you need to verify which accounts and qos you have access to, you may view your *association*:

.. code-block:: console
 
    $ sacctmgr show association where user=johndoe

    Cluster    Account       User       QOS   Def QOS
    ------- ---------- ---------- --------- ---------
    mercury        mba    johndoe      clay      clay


Submitting Array of Jobs
!!!!!!!!!!!!!!!!!!!!!!!!

It is sometimes necessary to submit a collection of similar jobs.
This can be accomplished using job arrays with the :code:`--array` option.
In the example below, four separate jobs will be launched by the scheduler each with a unique :code:`$SLURM_ARRAY_TASK_ID`.

.. code-block:: bash
   :caption: submit.sh
   :linenos:

   #!/bin/bash

   #SBATCH --array=0-3

   # Load the software module
   module load python/booth/3.10

   # Access unique array task ID using environment variable
   echo "Array ID: $SLURM_ARRAY_TASK_ID"
   srun python3 -c "import os; x=os.environ['SLURM_ARRAY_TASK_ID']; print(x)"


Managing Jobs
-------------

The Slurm job scheduler provides several command-line tools for checking on the status of your jobs and for managing them.
For a complete list of Slurm commands, see the Slurm man pages.
Here are a few commands that you may find particularly useful:

The most common commands can be summarized as:

:`srun`_: Obtain a job allocation and execute an application
:`sbatch`_: Submit a batch script to the slurm scheduler.
:`sacct`_: retrieve job history and information about past jobs
:`scancel`_: cancel jobs you have submitted
:`squeue`_: find out the status of queued jobs
:`sinfo`_: view information about Slurm nodes and partitions.

.. In the next couple sections we explain how to use :code:`squeue` to find out the status of your submitted jobs, and scancel to cancel jobs in the queue.

.. _srun: https://slurm.schedmd.com/srun.html
.. _sbatch: https://slurm.schedmd.com/sbatch.html
.. _sacct: https://slurm.schedmd.com/sacct.html
.. _scancel: https://slurm.schedmd.com/scancel.html
.. _squeue: https://slurm.schedmd.com/squeue.html
.. _sinfo: https://slurm.schedmd.com/sinfo.html

Checking your jobs
~~~~~~~~~~~~~~~~~~

Use the :code:`squeue` command to check on the status of your jobs, and other jobs running on Mercury.
The simplest invocation lists all jobs that are currently running or waiting in the job queue (“pending”), along with details about each job such as the job id and the number of nodes requested:

.. code-block:: console

   $ squeue
             
   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     924  standard  probe_A vargaslo  R       0:35      1 mcn06
     925  standard  probe_B vargaslo  R       0:34      1 mcn06
     926  standard  survey1   hchyde  R       0:32      1 mcn06
     927  standard  probe_C vargaslo  R       0:30      1 mcn06
     928  standard  survey2   hchyde PD       0:00      1 (Resources)

Any job with 0:00 under the TIME column is a job that is still waiting in the queue.
In the above case, there are not enough resources to run all jobs, so job 928 is waiting in the queue.
:code:`(Resources)` means that the job is waiting in the queue for available resources to free up.
Another common reason for a job to be waiting in the queue is if other jobs have higher priority.
This will be listed as :code:`(Priority)`.

To view only the jobs that you have submitted, use the :code:`--user` flag

.. code-block:: console

   $ squeue --user=hchyde

   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     926  standard  survey1   hchyde  R       0:32      1 mcn06
     928  standard  survey2   hchyde PD       0:00      1 (Resources)

This command has many other useful options for querying the status of the queue and getting information about individual jobs.
For example, to get information about all jobs that are waiting to run on the standard partition, enter:

.. code-block:: console

   $ squeue --state=PENDING --partition=standard

Alternatively, to get information about all your jobs that are running on the standard partition, type:

.. code-block:: console

   $ squeue --state=RUNNING --partition=standard

The last column of the output tells us which nodes are allocated for each job.
For more information, consult the command-line help by typing :code:`squeue --help`, or visit the official online `documentation`_.

.. _documentation: http://www.slurm.schedmd.com/squeue.html


Canceling your jobs
~~~~~~~~~~~~~~~~~~~

To cancel a job you have submitted, use the :code:`scancel` command. 
This requires you to specify the id of the job you wish to cancel. 
For example, to cancel a job with id 8885128, do the following:

.. code-block:: console

   $ scancel 8885128

If you are unsure what is the id of the job you would like to cancel, see the JOBID column from running :code:`squeue --user=<username>`.

To cancel all jobs you have submitted that are either running or waiting in the queue, enter the following:

.. code-block:: console

   $ scancel --user=<username>

Viewing Past Jobs
~~~~~~~~~~~~~~~~~

Jobs that have completed can be viewed using :code:`sacct`.
This can be useful for example to see how much memory a job consumed.

.. code-block:: console

    # view job with jobid 993
    $ sacct -j 993

    # view amount of memory used during job execution
    $ sacct -j <jobID> --format=User,MaxRss,MaxVMSize,Jobname,partition,start,end

    # view all jobs started after a specific date for a specific user
    $ sacct --starttime=2018-06-10 --user=<BoothID>