SoC Compute Cluster GPU Guide
This document describes how to access the SoC Compute Cluster's GPU nodes.
We will use this for the second part of this module (GPGPU programming) as our soctf
machines do not have GPUs of their own. To be extra clear: you cannot use soctf
machines for Lab 3 / Assignment 2 - you will use the SoC Compute Cluster instead.
Note: Pre-setup / troubleshooting before Accessing the SoC Compute Cluster
- If you use VSCode, we recommend you add these settings to your "User Settings (JSON)" file, right before the end of the file (before the ending closing }). To get to this, press "Control-Shift-P" and then type "User Settings JSON", press ENTER
"remote.SSH.useLocalServer": false,
"remote.SSH.useExecServer": false,
"remote.SSH.enableDynamicForwarding": false,
"remote.SSH.showLoginTerminal": false,
"remote.SSH.enableRemoteCommand": false
-
If you have issues with opening a terminal or a VSCode session after these setting changes, these are our recommendations (note that we don't have control over the SoC cluster)
- Connect to a specific login node: e.g.,
xlogin0
,xlogin1
, orxlogin2
, instead of the genericxlog/xlogin
, which are load balancers. - If you have access to a terminal session, you can try to kill any of your existing processes, you can list them via
ps aux | grep <your username>
.
- Connect to a specific login node: e.g.,
-
xlogin
nodes block outgoing connections to port 22 (SSH). If you want to use synchronize your GitHub repository (i.e.git clone
) via SSH, create the file~/.ssh/config
if it does not exist, and add the following lines to it to bypass the block.
Accessing the SoC Compute Cluster
- You will need an SoC UNIX ID to access these machines. If you don't have one, please go to https://mysoc.nus.edu.sg/~newacct to create one.
- This is entirely different from your CS3210 lab credentials - please forget about those for now.
- You need to enable your access to the SoC Compute Cluster at https://mysoc.nus.edu.sg/~myacct/services.cgi.
- You will either need to be within the SoC network or login to the SoC VPN - same restrictions as our
soctf machines.
- SSH to
your_soc_unix_id_here@xlogin.comp.nus.edu.sg
- You will be prompted for a password. Enter your SoC UNIX password.
- Now you have access to SoC's Slurm Compute Cluster. You can run
sinfo
to see the available machines. You can also runsrun
to run a job on the cluster. Please see this link for more details on how to use SoC Slurm, although the instructions from our labs should suffice.
Differences vs soctf
cluster
- SoC Slurm partitions are based on job length: the default partition (
normal, gpu
) can host jobs up to 15 minutes (by default, but the time can be increased up to 3 hours). We should only need thenormal, gpu
partition for CS3210, unless the nodes you require are not available onnormal, gpu
. Please see this link for more details on partitions. - Targeting a certain hardware type: instead of using
--partition
, we specify the GPUs we want to run our slurm job on with either--constraint
,--gpus
or--gres
. More on that later.
Running jobs on GPU nodes
Now that you have are on the login node of the SoC compute cluster, we can start running jobs on the GPU nodes.
Node list and details
The list of nodes in the cluster and some useful details about them are:
Node Name | Slurm GPU Name | NVIDIA GPU Name | Compute Capability | Memory | Notes |
---|---|---|---|---|---|
xgpc[0-9] | nv | Tesla V100 | 7.0 | 16GB | |
xgpd[0-9] | nv | Titan V | 7.0 | 12GB | |
xgpe[0-11] | nv | Titan RTX | 7.5 | 24GB | |
xgpf[0-10] | nv | Tesla T4 | 7.5 | 16GB | |
xgpg[0-9] | a100-40 | A100 40GB | 8.0 | 40GB | |
xgph[0-9] | a100-80 | A100 80GB | 8.0 | 80GB | |
xgph[10-19] | a100-40 | A100 80GB | 8.0 | 40GB | Each A100 80GB node is a NVIDIA Multi-Instance GPU, split into 2x 40GB instances |
xgpi[0-9] | h100-96 | H100 96GB | 9.0 | 96GB | Each node contains 2x H100 96GB. |
xgpi[10-19] | h100-47 | H100 96GB | 9.0 | 47GB | Each 2x H100 96GB node is a NVIDIA Multi-Instance GPU, split into 4x 47GB instances |
One of the most important things to note is the GPUs's Compute Capability. This indicates the features supported by the GPU, and is important for compiling your CUDA code.
You can find more details at SoC's Compute Cluster Hardware Documentation.
Requesting a GPU resource
By default, SoC Slurm will deny you from dispatching your job if you do not explicitly request for a GPU. This is done by specifying a Slurm GPU resource.
For srun
, that means adding the option --gpus=1
or -G 1
. For sbatch
, that means adding the line #SBATCH --gpus=1
to the top of your script.
Requesting for specific GPU resources
While --gpus=1
will get you any machine with a GPU, you will probably want to request for a specific type of node (for e.g., to test under the same Compute Capability).
Requesting via node type
This method should be used for running jobs on older GPUs (i.e. not A100/H100). To request a specific node type (e.g., you only want the machines with Titan V) you can add the --constraint
option to your srun/sbatch
command, e.g., for requesting any xgpd
node that has Titan V,
Requesting via GPU name
This method is required to run jobs on the A100 or H100 nodes. To request a specific GPU type (e.g., you only want an entire NVIDIA A100 80GB GPU), you can add the option --gpus=<gpu_type>
or -G <gpu_type>
to your srun/sbatch
command, e.g., for requesting an entire NVIDIA A100 80GB GPU,
Here is an example of allocating a GPU node, getting its hostname, and running nvidia-smi
(a tool to monitor the GPU) on it:
$ srun -G h100-96 bash -c "hostname; nvidia-smi"
xgpi2
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02 Driver Version: 555.42.02 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 NVL Off | 00000000:E3:00.0 Off | 0 |
| N/A 29C P0 62W / 400W | 1MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
---truncated---
Alternatively, you may use --gres
to achieve the same allocation:
In this case, we made SoC allocate xgpi2
(one of our most expensive GPU nodes) to us :)
You can see in the output that this is an NVIDIA H100 96GB GPU (worth around USD 30,000), that is running a CUDA version of 12.5.
Multi-Instance GPU (MIG) Nodes
Note that some of the xgph
and xgpi
machines are special!
Each H100 96GB GPU in the xgpi10-19
nodes are virtually split into 2 "smaller GPUs" (via NVIDIA's Multi-Instance GPU system). This allows 2 different jobs to be given an isolated slice of the xgpi
node's GPU, such that more users can use H100 GPUs.
You can test this via srun --gpus=h100-47 nvidia-smi
and looking at the "MIG Devices" section.
This is the same for the A100 80GB GPUs in the xgph10-19
nodes.
There is no performance benefit to using these MIG nodes (in fact your code is likely to run slower), but it exists since the nodes with the full A100 80GB or H100 96GB GPUs might be fully occupied. This allows more users to test their code on Compute Capability 8.0 or 9.0.