R

R is a widely used programming language with support for many statistical packages. Users interested in learning about the RStudio development environment may find their cheatsheets useful.

serial R

Interactive Mode

  1. SSH onto the mercury login node ssh <BoothID>@mercury.chicagobooth.edu

  2. Request an interactive session on a compute node srun --pty bash --login

  3. Load the module with the desired version of R module load R/4.3/4.3.2

  4. Start the R interpreter by typing: R

Batch Mode

To run an R job in batch mode, generate a submission script similar to the one below, and run by typing sbatch submit.sh.

submit.sh
#!/bin/bash

#SBATCH --account=phd
#SBATCH --mem=1G
#SBATCH --time=0-01:00:00
#SBATCH --job-name=example_job

# Load the module with the desired version of R
module load R/4.3/4.3.2

# run Rscript (output will be written to slurm-<jobid>.out)
srun Rscript  myscript.R

parallel R

The following is an example project for running a parallel job in R on Mercury using a foreach loop with the doParallel package as a “parallel backend”. It runs a simple calculation in both serial and parallel mode, and then calculates the performance speedup relative to the number of cores. Follow this basic strategy to determine the efficiency of your parallel program and select an appropriate number of cores. More cores is not always better due to communication overhead and other factors, relative speedup drops as the number of cores increases according to Amdahl’s Law.

Note

The use of parallel::detectCores() is discouraged when setting up a parallel job on Mercury. Instead, use parallelly::availableCores() to ensure that your program does not oversubscribe the CPU cores.

# Main script that uses parallelization across multiple cores on single node
# The number of slots (cores) allocated by scheduler is passed to this script

# Example adapted from doParallel documentation:
# https://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf

# Load libraries
library(doParallel)

# Create cluster of multiple cores
Nslots <- parallelly::availableCores()
print(sprintf("%d slots were allocated", Nslots))
cl <- parallel::makeCluster(Nslots)
doParallel::registerDoParallel(cl)

# Do parallel work
# Use %dopar% for parallel or %do% for serial processing of foreach loop
x <- datasets::iris[which(datasets::iris[,5] != "setosa"), c(1,5)]
trials <- 10000
ptime <- system.time({
  r <- foreach(icount(trials), .combine=cbind) %dopar% {
    ind <- sample(100, 100, replace=TRUE)
    result1 <- stats::glm(x[ind,2]~x[ind,1], family=binomial(logit))
    stats::coefficients(result1)
  }
})
ptime

# Run the foreach loop in serial (to test speedup)
x <- datasets::iris[which(datasets::iris[,5] != "setosa"), c(1,5)]
stime <- system.time({
  r <- foreach(icount(trials), .combine=cbind) %do% {
    ind <- sample(100, 100, replace=TRUE)
    result1 <- stats::glm(x[ind,2]~x[ind,1], family=binomial(logit))
    stats::coefficients(result1)
  }
})
stime

# Speedup analysis
speedup <- stime[3] / ptime[3]
print(sprintf("Speedup was %0.2f fold using %d slots", speedup, Nslots))
print(sprintf("Parallel efficiency was %0.1f percent", speedup/Nslots*100))

# Stop cluster
print("Finished.")
parallel::stopCluster(cl)
quit(save="no")