Slurm Workload Manager
Overview
The lab machines are managed by a workload manager called Slurm. Slurm is a widely used workload manager in the HPC community. It is used in many supercomputers around the world, including many of the top 500 supercomputers in the world.
Slurm is necessary as we have many users (> 200!) sharing the same few machines. Slurm allows us to manage the resources on the machines and ensure that everyone gets a fair share of the resources. Also, it allows you to get accurate performance for your programs, as we can ensure that no other users are using the machine you are running your job on at the same time.
Slurm Quickstart
Logging in
To use Slurm, you need to log in to one of the lab machines. See Accessing the lab machines for more details.
Getting information about the cluster
To get information about the cluster, you can use the sinfo
command. This will show you the state of the cluster, including the number of nodes, "partitions" (groups of nodes), and the state of each node.
You might see something like this:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all* up 3:00:00 16 idle soctf-pdc-[004-008,012-016,018-021,023-024]
xs-4114 up 3:00:00 5 idle soctf-pdc-[004-008]
i7-7700 up 3:00:00 5 idle soctf-pdc-[012-016]
dxs-4114 up 3:00:00 2 idle soctf-pdc-[018-019]
i7-9700 up 3:00:00 2 idle soctf-pdc-[020-021]
xw-2245 up 3:00:00 2 idle soctf-pdc-[023-024]
Submitting a job
There are a few ways to submit a job.
Using the srun
command
The simplest way is to use the srun
command. This allows you to run a command on a node.
For example, to run the hostname
command on a node, you can do:
You can also run a command on multiple nodes. For example, to run the hostname
command on 2 nodes, you can do:
You can also run a command on a specific partition. For example, to run the hostname
command on 2 nodes in the xs-4114
partition, you can do:
You can also ask for more time to run your job. By default, all jobs are given 1 minute at most to run if a timeout is not specified, with a maximum of 3 hours if you specify a time limit. For example, to run the sleep command for 2 minutes, you can do:
However, this is not the most common way to use Slurm because it is interactive - you need to be logged into the machine while running the entire command, just like how you would run a command on your own machine. This is difficult for long running jobs as you need to keep your terminal open for the entire duration of the job.
Using the sbatch
command
The most common way is to use the sbatch
command. This allows you to submit a job from the command line.
Let's first define a Slurm job script. A Slurm job script is a shell script that contains the commands that you want to run. For example, let's create a job script called myjob.sh
that contains the following:
#!/bin/bash
#!/bin/bash
#SBATCH --job-name=myjob
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=1gb
#SBATCH --time=00:00:30
#SBATCH --output=myjob_%j.log
#SBATCH --error=myjob_%j.log
##SBATCH --partition=i7-7700
echo "Running job!"
echo "We are running on $(hostname)"
echo "Job started at $(date)"
# Actual "job"
sleep 5;
# This is useful to know (in the logs) when the job ends
echo "Job ended at $(date)"
Components of this job script
#!/bin/bash
- This is the shebang line. It tells the shell that this is a bash script.#SBATCH --job-name=myjob
- This is the name of the job. This is useful for identifying your job in the queue.#SBATCH --nodes=1
- This is the number of nodes that you want to run your job on. In this case, we are asking for 1 node. Note that we allocate the entire node to your job when it is running - no other jobs (i.e., from other users or even yourself) will be running on that node!#SBATCH --ntasks=1
- This is the number of tasks that you want to run on each node. In this case, we are asking for 1 task.#SBATCH --mem=1gb
- This is the amount of memory that you want to reserve for your job. In this case, we are asking for 1 GB of memory.#SBATCH --time=00:00:30
- This is the amount of time that you want to reserve for your job. In this case, we are asking for 30 seconds.#SBATCH --output=myjob_%j.log
- This is the name of the output file. In this case, we are asking for the output file to be calledmyjob_<jobid>.log
. The<jobid>
will be replaced by the job ID of the job. This is useful for identifying your job in the queue.#SBATCH --error=myjob_%j.log
- same as above, but for the error file.#SBATCH --partition=i7-7700
- This is the partition you would like to run your job on. This setting is commented out in the script above, remove one of the#
s to enable it.- This job script will run the
sleep
command for 5 seconds.
Submitting the job
To submit the job, you can do:
This will submit the job to Slurm, and Slurm will run the job when it is ready. You can check the status of the job using the squeue
command:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
123456 all myjob cristina R 0:05 1 soctf-pdc-004
You can see that the job is running on node soctf-pdc-004
.
When the job is done, you can see that the job is no longer in the queue:
You can also see that the job has finished by looking at the output file. By default, the output file is called ,myjob-<jobid>.out
. In this case, the output file is myjob-123456.out
. You can see the contents of the output file using the cat
command:
You can also check sacct
to see the status of your job:
$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
123456 hostname all students 20 COMPLETED 0:0
Note
We allocate the entire node to you for each job. That is, there are guaranteed to be no other jobs running on that node as long as your job is on it. Therefore, there is no need to pass flags such as --exclusive
to Slurm
Slurm Foundations
This section contains more fundamental details about Slurm.
Slurm Partitions
The lab machines are divided into partitions. Each partition is a group of machines that have similar characteristics. For example, the xs-4114
partition contains machines that have Intel Xeon Silver 4114 CPUs.
A partition is like a job queue. Note that although a node can belong to multiple partitions, there is only a single global job queue for all nodes.
If a job is submitted without specifying a partition, the job will be executed on a node in the default partition.
The default partition is "all". You can run the command sinfo
and the partition name marked with * is the default partition.
There is a time limit for each partition. If your job exceeds the partition’s time limit, it will be killed and the state of the job will be set to TIMEOUT. There is also a default time limit for each job - for jobs where no time limit was specified. In our cluster, the default time limit is 1 minute.
Fair Share and Priority
Slurm uses a queue to schedule the submitted jobs. If two or more jobs require the same node, then the job that gets to be scheduled first depends on the scheduling policy of Slurm.
The scheduling policy of Slurm is configured to use a fair-share policy. The job of a student x who has a higher usage of the cluster resources will have a lower priority to be scheduled than the job of a student y that has a lower usage of the cluster resources, regardless of the submission time of the job, i.e., the jobs are not scheduled based on FIFO order.
Currently, the cluster resources used by a job is defined as allocated_cpus * seconds
.
* allocated_cpus
is the total number of CPU cores on all the requested nodes of a job. For example, if a job executes on two nodes where the first node has 8 cores and the second node has 20 cores, the value of “allocated_cpus” will be 28.
* seconds
is the elapsed time of the job.
The cluster resource usage of a student is the total cluster resources used by all the jobs submitted by the student. The cluster resource usage is then used to compute the fair share value of a student. The fair share value will be used in job scheduling.
Note that Slurm "forgets" your usage after a while. This is to ensure that you can make mistakes but still be able to get back your normal priority on the cluster a short "cooldown" period.
Slurm Recipes for Common Tasks
Advanced sacct
options
- By default,
sacct
only shows jobs that are within the current day.- Use the
-S <datetime>
option to view jobs that are before today. - Specifying the
-S <datetime>
option will show all the jobs that are after the given. For example, run sacct -S 2022-02-01
to view jobs that are after2022-02-01T00:00
- Use the
- You can use the
-o
option ofsacct
to specify custom fields to display for your job. Refer to the documentation ofsacct
to view all available fields and how to specify the output format.- For example, to view the job ID, job name, partition, user, state, and time elapsed, you can do
sacct -o jobid,jobname,partition,user,state,elapsed
.
- For example, to view the job ID, job name, partition, user, state, and time elapsed, you can do
Slurm constraints
You can use the --constraint
option to run jobs on nodes that satisfy certain constraints. This allows for more complex workflows. For example, to submit a job to run on two nodes of type xs-4114
and one node of type i7-7700
in the default partition:
Killing jobs
You can kill a job using the scancel
command. For example, to kill job 123456:
Viewing your current fair share
You can view your current fair share using the sshare
command. For example, to view your current fair share:
A lower fair share means a lower priority. You can compare this with the fair share of your friends to see who is using the cluster the most!
Debugging programs
If a job fails and you cannot figure out why (e.g., stdout/err not appearing), you can open a terminal on the node that the job was running on. To do this, you can use the srun
command with the --pty
option. For example, to open a terminal on a specific node soctf-pdc-005
:
Make sure to exit the terminal when you are done debugging by using the exit
command. Remember: all the time you leave your terminal job open on that node is counted towards your fair share.