SoC Compute Cluster GPU Guide

This document describes how to access the SoC Compute Cluster's GPU nodes.

We will use this for the second part of this module (GPGPU programming) as our soctf machines do not have GPUs of their own. To be extra clear: you cannot use soctf machines for Lab 3 / Assignment 2 - you will use the SoC Compute Cluster instead.

Note: Pre-setup / troubleshooting before Accessing the SoC Compute Cluster

If you use VSCode, we recommend you add these settings to your "User Settings (JSON)" file, right before the end of the file (before the ending closing }). To get to this, press "Control-Shift-P" and then type "User Settings JSON", press ENTER

    "remote.SSH.useLocalServer": false,
    "remote.SSH.useExecServer": false,
    "remote.SSH.enableDynamicForwarding": false,
    "remote.SSH.showLoginTerminal": false,
    "remote.SSH.enableRemoteCommand": false

If you have issues with opening a terminal or a VSCode session after these setting changes, these are our recommendations (note that we don't have control over the SoC cluster)
- Connect to a specific login node: e.g., xlogin0, xlogin1, or xlogin2, instead of the generic xlog/xlogin, which are load balancers.
- If you have access to a terminal session, you can try to kill any of your existing processes, you can list them via ps aux | grep <your username>.
xlogin nodes block outgoing connections to port 22 (SSH). If you want to use synchronize your GitHub repository (i.e. git clone) via SSH, create the file ~/.ssh/config if it does not exist, and add the following lines to it to bypass the block.

Host github.com
    Hostname ssh.github.com
    Port 443

Accessing the SoC Compute Cluster

You will need an SoC UNIX ID to access these machines. If you don't have one, please go to https://mysoc.nus.edu.sg/~newacct to create one.
1. This is entirely different from your CS3210 lab credentials - please forget about those for now.
You need to enable your access to the SoC Compute Cluster at https://mysoc.nus.edu.sg/~myacct/services.cgi.
You will either need to be within the SoC network or login to the SoC VPN - same restrictions as our soctf machines.
SSH to your_soc_unix_id_here@xlogin.comp.nus.edu.sg
You will be prompted for a password. Enter your SoC UNIX password.
Now you have access to SoC's Slurm Compute Cluster. You can run sinfo to see the available machines. You can also run srun to run a job on the cluster. Please see this link for more details on how to use SoC Slurm, although the instructions from our labs should suffice.

Differences vs `soctf` cluster

SoC Slurm partitions are based on job length: the default partition (normal, gpu) can host jobs up to 15 minutes (by default, but the time can be increased up to 3 hours). We should only need the normal, gpu partition for CS3210, unless the nodes you require are not available on normal, gpu. Please see this link for more details on partitions.
Targeting a certain hardware type: instead of using --partition, we specify the GPUs we want to run our slurm job on with either --constraint, --gpus or --gres. More on that later.

Running jobs on GPU nodes

Now that you have are on the login node of the SoC compute cluster, we can start running jobs on the GPU nodes.

Node list and details

The list of nodes in the cluster and some useful details about them are:

Node Name	Slurm GPU Name	NVIDIA GPU Name	Compute Capability	Memory	Notes
xgpc[0-9]	nv	Tesla V100	7.0	16GB
xgpd[0-9]	nv	Titan V	7.0	12GB
xgpe[0-11]	nv	Titan RTX	7.5	24GB
xgpf[0-10]	nv	Tesla T4	7.5	16GB
xgpg[0-9]	a100-40	A100 40GB	8.0	40GB
xgph[0-9]	a100-80	A100 80GB	8.0	80GB
xgph[10-19]	a100-40	A100 80GB	8.0	40GB	Each A100 80GB node is a NVIDIA Multi-Instance GPU, split into 2x 40GB instances
xgpi[0-9]	h100-96	H100 96GB	9.0	96GB	Each node contains 2x H100 96GB.
xgpi[10-19]	h100-47	H100 96GB	9.0	47GB	Each 2x H100 96GB node is a NVIDIA Multi-Instance GPU, split into 4x 47GB instances

One of the most important things to note is the GPUs's Compute Capability. This indicates the features supported by the GPU, and is important for compiling your CUDA code.

You can find more details at SoC's Compute Cluster Hardware Documentation.

Requesting a GPU resource

By default, SoC Slurm will deny you from dispatching your job if you do not explicitly request for a GPU. This is done by specifying a Slurm GPU resource.

For srun, that means adding the option --gpus=1 or -G 1. For sbatch, that means adding the line #SBATCH --gpus=1 to the top of your script.

Requesting for specific GPU resources

While --gpus=1 will get you any machine with a GPU, you will probably want to request for a specific type of node (for e.g., to test under the same Compute Capability).

Requesting via node type

This method should be used for running jobs on older GPUs (i.e. not A100/H100). To request a specific node type (e.g., you only want the machines with Titan V) you can add the --constraint option to your srun/sbatch command, e.g., for requesting any xgpd node that has Titan V,

$ srun --gpus=1 --constraint=xgpd bash -c "hostname; nvidia-smi"

Requesting via GPU name

This method is required to run jobs on the A100 or H100 nodes. To request a specific GPU type (e.g., you only want an entire NVIDIA A100 80GB GPU), you can add the option --gpus=<gpu_type> or -G <gpu_type> to your srun/sbatch command, e.g., for requesting an entire NVIDIA A100 80GB GPU,

$ srun --gpus=a100-80 bash -c "hostname; nvidia-smi"

Here is an example of allocating a GPU node, getting its hostname, and running nvidia-smi (a tool to monitor the GPU) on it:

$ srun -G h100-96 bash -c "hostname; nvidia-smi"
xgpi2
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 NVL                Off |   00000000:E3:00.0 Off |                    0 |
| N/A   29C    P0             62W /  400W |       1MiB /  95830MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

---truncated---

Alternatively, you may use --gres to achieve the same allocation:

srun --gres="gpu:h100-96:1" bash -c "hostname; nvidia-smi"`

In this case, we made SoC allocate xgpi2 (one of our most expensive GPU nodes) to us :)

You can see in the output that this is an NVIDIA H100 96GB GPU (worth around USD 30,000), that is running a CUDA version of 12.5.

Multi-Instance GPU (MIG) Nodes

Note that some of the xgph and xgpi machines are special!

Each H100 96GB GPU in the xgpi10-19 nodes are virtually split into 2 "smaller GPUs" (via NVIDIA's Multi-Instance GPU system). This allows 2 different jobs to be given an isolated slice of the xgpi node's GPU, such that more users can use H100 GPUs.

You can test this via srun --gpus=h100-47 nvidia-smi and looking at the "MIG Devices" section.

This is the same for the A100 80GB GPUs in the xgph10-19 nodes.

There is no performance benefit to using these MIG nodes (in fact your code is likely to run slower), but it exists since the nodes with the full A100 80GB or H100 96GB GPUs might be fully occupied. This allows more users to test their code on Compute Capability 8.0 or 9.0.