Slurm Workload Manager

Overview

The lab machines are managed by a workload manager called Slurm. Slurm is a widely used workload manager in the HPC community. It is used in many supercomputers around the world, including many of the top 500 supercomputers in the world.

Slurm is necessary as we have many users (> 200!) sharing the same few machines. Slurm allows us to manage the resources on the machines and ensure that everyone gets a fair share of the resources. Also, it allows you to get accurate performance for your programs, as we can ensure that no other users are using the machine you are running your job on at the same time.

Slurm Quickstart

Logging in

To use Slurm, you need to log in to one of the lab machines. See Accessing the lab machines for more details.

Getting information about the cluster

To get information about the cluster, you can use the sinfo command. This will show you the state of the cluster, including the number of nodes, "partitions" (groups of nodes), and the state of each node.

You might see something like this:

$ sinfo

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up    3:00:00     16   idle soctf-pdc-[004-008,012-016,018-021,023-024]
xs-4114      up    3:00:00      5   idle soctf-pdc-[004-008]
i7-7700      up    3:00:00      5   idle soctf-pdc-[012-016]
dxs-4114     up    3:00:00      2   idle soctf-pdc-[018-019]
i7-9700      up    3:00:00      2   idle soctf-pdc-[020-021]
xw-2245      up    3:00:00      2   idle soctf-pdc-[023-024]

Submitting a job

There are a few ways to submit a job.

Using the `srun` command

The simplest way is to use the srun command. This allows you to run a command on a node.

For example, to run the hostname command on a node, you can do:

$ srun hostname
soctf-pdc-004.d1.comp.nus.edu.sg

You can also run a command on multiple nodes. For example, to run the hostname command on 2 nodes, you can do:

$ srun -N 2 hostname
soctf-pdc-004.d1.comp.nus.edu.sg
soctf-pdc-005.d1.comp.nus.edu.sg

You can also run a command on a specific partition. For example, to run the hostname command on 2 nodes in the xs-4114 partition, you can do:

$ srun -N 2 -p xs-4114 hostname
soctf-pdc-004.d1.comp.nus.edu.sg
soctf-pdc-005.d1.comp.nus.edu.sg

You can also ask for more time to run your job. By default, all jobs are given 1 minute at most to run if a timeout is not specified, with a maximum of 3 hours if you specify a time limit. For example, to run the sleep command for 2 minutes, you can do:

$ srun -t 2:00 sleep 2m

However, this is not the most common way to use Slurm because it is interactive - you need to be logged into the machine while running the entire command, just like how you would run a command on your own machine. This is difficult for long running jobs as you need to keep your terminal open for the entire duration of the job.

Using the `sbatch` command

The most common way is to use the sbatch command. This allows you to submit a job from the command line.

Let's first define a Slurm job script. A Slurm job script is a shell script that contains the commands that you want to run. For example, let's create a job script called myjob.sh that contains the following:

#!/bin/bash
#!/bin/bash
#SBATCH --job-name=myjob
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=1gb
#SBATCH --time=00:00:30
#SBATCH --output=myjob_%j.log
#SBATCH --error=myjob_%j.log
##SBATCH --partition=i7-7700
echo "Running job!"
echo "We are running on $(hostname)"
echo "Job started at $(date)"
# Actual "job"
sleep 5;
# This is useful to know (in the logs) when the job ends
echo "Job ended at $(date)"

Components of this job script

#!/bin/bash - This is the shebang line. It tells the shell that this is a bash script.
#SBATCH --job-name=myjob - This is the name of the job. This is useful for identifying your job in the queue.
#SBATCH --nodes=1 - This is the number of nodes that you want to run your job on. In this case, we are asking for 1 node. Note that we allocate the entire node to your job when it is running - no other jobs (i.e., from other users or even yourself) will be running on that node!
#SBATCH --ntasks=1 - This is the number of tasks that you want to run on each node. In this case, we are asking for 1 task.
#SBATCH --mem=1gb - This is the amount of memory that you want to reserve for your job. In this case, we are asking for 1 GB of memory.
#SBATCH --time=00:00:30 - This is the amount of time that you want to reserve for your job. In this case, we are asking for 30 seconds.
#SBATCH --output=myjob_%j.log - This is the name of the output file. In this case, we are asking for the output file to be called myjob_<jobid>.log. The <jobid> will be replaced by the job ID of the job. This is useful for identifying your job in the queue.
#SBATCH --error=myjob_%j.log - same as above, but for the error file.
#SBATCH --partition=i7-7700 - This is the partition you would like to run your job on. This setting is commented out in the script above, remove one of the #s to enable it.
This job script will run the sleep command for 5 seconds.

Submitting the job

To submit the job, you can do:

$ sbatch myjob.sh
Submitted batch job 123456

This will submit the job to Slurm, and Slurm will run the job when it is ready. You can check the status of the job using the squeue command:

$ squeue

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
123456       all   myjob cristina  R       0:05      1 soctf-pdc-004

You can see that the job is running on node soctf-pdc-004.

When the job is done, you can see that the job is no longer in the queue:

$ squeue

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

You can also see that the job has finished by looking at the output file. By default, the output file is called ,myjob-<jobid>.out. In this case, the output file is myjob-123456.out. You can see the contents of the output file using the cat command:

$ cat myjob-123456.out
Hello world!

You can also check sacct to see the status of your job:

$ sacct

JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
123456         hostname        all   students         20  COMPLETED      0:0

Note

We allocate the entire node to you for each job. That is, there are guaranteed to be no other jobs running on that node as long as your job is on it. Therefore, there is no need to pass flags such as --exclusive to Slurm

Slurm Foundations

This section contains more fundamental details about Slurm.

Slurm Partitions

The lab machines are divided into partitions. Each partition is a group of machines that have similar characteristics. For example, the xs-4114 partition contains machines that have Intel Xeon Silver 4114 CPUs.

A partition is like a job queue. Note that although a node can belong to multiple partitions, there is only a single global job queue for all nodes.

If a job is submitted without specifying a partition, the job will be executed on a node in the default partition. The default partition is "all". You can run the command sinfo and the partition name marked with * is the default partition.

There is a time limit for each partition. If your job exceeds the partition’s time limit, it will be killed and the state of the job will be set to TIMEOUT. There is also a default time limit for each job - for jobs where no time limit was specified. In our cluster, the default time limit is 1 minute.

Slurm uses a queue to schedule the submitted jobs. If two or more jobs require the same node, then the job that gets to be scheduled first depends on the scheduling policy of Slurm.

The scheduling policy of Slurm is configured to use a fair-share policy. The job of a student x who has a higher usage of the cluster resources will have a lower priority to be scheduled than the job of a student y that has a lower usage of the cluster resources, regardless of the submission time of the job, i.e., the jobs are not scheduled based on FIFO order.

Currently, the cluster resources used by a job is defined as allocated_cpus * seconds. * allocated_cpus is the total number of CPU cores on all the requested nodes of a job. For example, if a job executes on two nodes where the first node has 8 cores and the second node has 20 cores, the value of “allocated_cpus” will be 28. * seconds is the elapsed time of the job.

The cluster resource usage of a student is the total cluster resources used by all the jobs submitted by the student. The cluster resource usage is then used to compute the fair share value of a student. The fair share value will be used in job scheduling.

Note that Slurm "forgets" your usage after a while. This is to ensure that you can make mistakes but still be able to get back your normal priority on the cluster a short "cooldown" period.

Slurm Recipes for Common Tasks

Advanced `sacct` options

By default, sacct only shows jobs that are within the current day.
- Use the -S <datetime> option to view jobs that are before today.
- Specifying the -S <datetime> option will show all the jobs that are after the given . For example, run sacct -S 2022-02-01 to view jobs that are after 2022-02-01T00:00
You can use the -o option of sacct to specify custom fields to display for your job. Refer to the documentation of sacct to view all available fields and how to specify the output format.
- For example, to view the job ID, job name, partition, user, state, and time elapsed, you can do sacct -o jobid,jobname,partition,user,state,elapsed.

Slurm constraints

You can use the --constraint option to run jobs on nodes that satisfy certain constraints. This allows for more complex workflows. For example, to submit a job to run on two nodes of type xs-4114 and one node of type i7-7700 in the default partition:

$ srun --constraint="[xs-4114*2&i7-7700*1]" hostname

Killing jobs

You can kill a job using the scancel command. For example, to kill job 123456:

$ scancel 123456

You can view your current fair share using the sshare command. For example, to view your current fair share:

$ sshare

A lower fair share means a lower priority. You can compare this with the fair share of your friends to see who is using the cluster the most!

Debugging programs

If a job fails and you cannot figure out why (e.g., stdout/err not appearing), you can open a terminal on the node that the job was running on. To do this, you can use the srun command with the --pty option. For example, to open a terminal on a specific node soctf-pdc-005:

$ srun --pty -w soctf-pdc-005 bash

Make sure to exit the terminal when you are done debugging by using the exit command. Remember: all the time you leave your terminal job open on that node is counted towards your fair share.