Maximizing Nvidia GPU utilization on Azure Kubernetes Service with Time-Slicing and Multi-Instance GPU (MIG)

Jay Lee
13 min readMay 29, 2024

--

November 30th, 2022 will be remembered in IT history as a monumental moment when OpenAI launched a service called ChatGPT for public access. Since then, it seems like we can’t go a single day without hearing about ‘AI’ or ‘Copilot’ in our daily lives. Needless to say, what powers ChatGPT to run is Graphical Processing Unit(GPU), providing the computational power for training and running LLMs. There’s no wonder why Nvidia’s stock price has been skyrocketing since ChatGPT’s release.

“If you had bought Nvidia on that day, you would be up 428% as of March 12. That means that $10,000 invested in Nvidia on the day ChatGPT came out would have turned into $52,800" — https://www.fool.com/investing/2024/03/18/if-you-invested-10000-in-nvidia-the-day-chatgpt-ca/

Interestingly rise of AI and LLMs is also driving the innovation in the Kubernetes world, which is the most popular platform choice across on-premise and cloud. AKS AI Toolchain Operator(KAITO) is really a good example of it. KAITO is a managed add-on that simplifies running OSS AI models on AKS clusters. KAITO automatically provisions the necessary GPU nodes and sets up the associated inference server as an endpoint server to your AI models. KAITO will be one of the next topics that I will discuss in the next article.

Recently, one developer asked for my help in setting up GPU sharing on AKS, so I decided to create a simple guide. I found a nice article on the Microsoft Tech Community ( https://techcommunity.microsoft.com/t5/azure-high-performance-computing/running-gpu-accelerated-workloads-with-nvidia-gpu-operator-on/ba-p/4061318), and most of steps are from this post with a bit of tweaks for optimization

Sharing Nvidia GPU with Time-slicing and MIG

When it comes to sharing Nvidia GPUs on Kubernetes, there are many different options but most popular options are Time-slicing and Multi Instance GPU(MIG). Let’s take a look at each of them.

Time Slicing

  1. Resource Sharing: Time slicing allows multiple workloads to share the same GPU by dividing the GPU time among them. This can be useful for workloads that are not latency-sensitive and can tolerate waiting for their turn.
  2. Maximizing Throughput: For workloads that can benefit from burst computation, time slicing can help maximize throughput by efficiently utilizing idle GPU cycles.
  3. Flexibility: Time slicing provides flexibility in managing GPU resources, as it allows more dynamic allocation of GPU time based on workload needs.
  4. Dynamic Workloads: In environments where workloads are highly dynamic and change frequently, time slicing can provide a way to ensure that all jobs get some GPU time, even if it doesn’t at the same time.

Multi-Instance GPU (MIG)

  1. Resource Isolation: MIG allows a single Nvidia A100 GPU to be partitioned into multiple, smaller independent GPU instances. This provides strict isolation between different workloads.
  2. Improved Utilization: By partitioning a single GPU into multiple instances, you can run multiple smaller workloads concurrently, which leads to better overall utilization of GPU resources.
  3. Cost Efficiency: For applications that do not require the full power of a single GPU, MIG provides a cost-effective way to leverage the GPU by allowing multiple smaller jobs to share the same physical GPU.
  4. Performance Consistency: Each MIG instance has dedicated resources, which ensures more predictable performance for each workload compared to sharing a single GPU without isolation.

Using Time-Slicing or MIG is not binary decision, meaning you can use both of them simultaneously. I will try out each options on AKS in this article.

Creating Cluster with GPU nodepool

Let’s create a small cluster with 1 system pool and 1 user pool with single GPU node. One thing to note is node-taints in user pool which will be used by the scheduler.

$ az aks create \
--name gpu-cluster \
--resource-group sandbox-rg \
--ssh-key-value ~/.ssh/id_rsa.pub \
--enable-aad \
--node-count 1 \
--node-vm-size Standard_DS3_v2 \
--network-plugin azure \
--network-policy calico \
--node-osdisk-type Ephemeral \
--max-pods 110 \
--service-cidr 10.10.0.0/24 \
--dns-service-ip 10.10.0.10 \
--load-balancer-backend-pool-type=nodeIP

$ az aks nodepool add --resource-group sandbox-rg --cluster-name gpu-cluster \
--name nc24ads --node-taints sku=gpu:NoSchedule \
--node-vm-size Standard_NC24ads_A100_v4 --node-count 1

Installing node-feature-discovery and GPU Operator

I will start with the Node Feature Discovery Operator which is a prerequisite for the NVIDIA GPU Operator. This operator manages the detection of hardware features and configuration in the cluster by labeling nodes with hardware-specific information.

$ helm install --wait --create-namespace -n gpu-operator node-feature-discovery node-feature-discovery --create-namespace --repo https://kubernetes-sigs.github.io/node-feature-discovery/charts --set-json master.config.extraLabelNs='["nvidia.com"]' --set-json worker.tolerations='[{ "effect": "NoSchedule", "key": "sku", "operator": "Equal", "value": "gpu"},{"effect": "NoSchedule", "key": "mig", "value":"notReady", "operator": "Equal"}]'
$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
node-feature-discovery-gc-68dcd44667-77dgp 1/1 Running 0 6m39s
node-feature-discovery-master-69d5598c6d-rbs6m 1/1 Running 0 6m39s
node-feature-discovery-worker-n8n46 1/1 Running 0 6m40s
node-feature-discovery-worker-rd7z4 1/1 Running 0 6m40s

The Node Feature Discovery Operator uses vendor PCI IDs to identify hardware in node. For Nvidia, PCI ID is 10de. Create a NodeFeatureRule as below.

$ kubectl apply -f - <<EOF
apiVersion: nfd.k8s-sigs.io/v1alpha1
kind: NodeFeatureRule
metadata:
name: nfd-gpu-rule
namespace: gpu-operator
spec:
rules:
- name: "nfd-gpu-rule"
labels:
"feature.node.kubernetes.io/pci-10de.present": "true"
matchFeatures:
- feature: pci.device
matchExpressions:
vendor: {op: In, value: ["10de"]}
EOF

Once NFD is set up, it’s time to install Nvidia GPU Operator. Nvidia GPU Operator automates the management of all Nvidia software components needed to provision GPU. These components include the Nvidia drivers (to enable CUDA), Kubernetes device plugin for GPUs, the Nvidia Container Toolkit, automatic node labelling using GFD, DCGM based monitoring and others. For more information, you can check out the official documentation here — https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html

To proceed, add the Nvidia Helm repository and update the repo list.

$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update

Before we install the GPU Operator, let’s review a few things:

  1. NFD is already installed, so we won’t need to enable it again. (--set nfd.enabled=false)
  2. The drivers were installed when the node pool was provisioned, so we’ll skip the driver installation. (--set driver.enabled=false)
  3. We’ll set the runtime class name to “nvidia-container-runtime,” which will be created by the GPU Operator.(--set operator.runtimeClass=nvidia-container-runtime)
$ helm install --wait gpu-operator -n gpu-operator nvidia/gpu-operator --set-json daemonsets.tolerations='[{ "effect": "NoSchedule", "key": "sku", "operator": "Equal", "value": "gpu"}]' --set nfd.enabled=false --set driver.enabled=false --set operator.runtimeClass=nvidia-container-runtime
$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-x8dxz 1/1 Running 0 3m24s
gpu-operator-7bbf8bb6b7-ssnzp 1/1 Running 0 3m34s
node-feature-discovery-gc-68dcd44667-77dgp 1/1 Running 0 25m
node-feature-discovery-master-69d5598c6d-rbs6m 1/1 Running 0 25m
node-feature-discovery-worker-n8n46 1/1 Running 0 25m
node-feature-discovery-worker-rd7z4 1/1 Running 0 25m
nvidia-container-toolkit-daemonset-tnvpj 1/1 Running 0 3m25s
nvidia-cuda-validator-jhhcv 0/1 Completed 0 3m14s
nvidia-dcgm-exporter-fsxsw 1/1 Running 0 3m24s
nvidia-device-plugin-daemonset-8mh9n 1/1 Running 0 3m24s
nvidia-mig-manager-9cwqt 1/1 Running 0 116s
nvidia-operator-validator-fbjft 1/1 Running 0 3m25s

As I explained earlier, GPU operator installs several components which are required to run any GPU workload on Kubernetes. At this moment, allocatable GPU is set to only one.

$ kubectl describe node aks-nc24ads-42333774-vmss000000 | grep -y7 Allocatable
...
Allocatable:
cpu: 23660m
ephemeral-storage: 119703055367
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 214295376Ki
nvidia.com/gpu: 1
pods: 30
...

Configuring Time-slicing

Configuring Time-slicing requires a bit of understanding on the Nvidia device plugin. Here is the introduction from official repository.

The NVIDIA device plugin for Kubernetes is a DaemonSet that allows you to automatically:

  • Expose the number of GPUs on each nodes of your cluster
  • Keep track of the health of your GPUs
  • Run GPU enabled containers in your Kubernetes cluster.

Below are steps to configure device plugin to set up Time-Slicing on Kubernetes. Note that I configure the replicas as 4 below.

$ az aks nodepool update --cluster-name rbac-cluster --resource-group sandbox-rg --nodepool-name nc24ads --labels "nvidia.com/device-plugin.config=a100"
$ kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config-all
namespace: gpu-operator
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4
EOF
$ kubectl patch clusterpolicies.nvidia.com/cluster-policy -n gpu-operator --type merge -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config-all", "default": "any"}}}}'

Change of cluster-policy will be detected by GPU operator, which will then restart the device plugin DaemonSet and GPU feature discovery to configure the time slicing on the node.

$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-p5lss 2/2 Running 0 8s
gpu-operator-7bbf8bb6b7-ssnzp 1/1 Running 0 18m
node-feature-discovery-gc-68dcd44667-77dgp 1/1 Running 0 39m
node-feature-discovery-master-69d5598c6d-rbs6m 1/1 Running 0 39m
node-feature-discovery-worker-n8n46 1/1 Running 0 39m
node-feature-discovery-worker-rd7z4 1/1 Running 0 39m
nvidia-container-toolkit-daemonset-tnvpj 1/1 Running 0 17m
nvidia-cuda-validator-jhhcv 0/1 Completed 0 17m
nvidia-dcgm-exporter-fsxsw 1/1 Running 0 17m
nvidia-device-plugin-daemonset-njfnf 2/2 Running 0 8s
nvidia-mig-manager-9cwqt 1/1 Running 0 16m
nvidia-operator-validator-fbjft 1/1 Running 0 17m

The device plugin DaemonSet makes some changes on the node. Can you spot what has changed?

$ kubectl describe node aks-nc24ads-42333774-vmss000000 | grep -y7 Allocatable
...
Allocatable:
cpu: 23660m
ephemeral-storage: 119703055367
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 214295372Ki
nvidia.com/gpu: 4
pods: 30
...

If you’re clueless, pay attention to Allocatable, specifically nvidia.com/gpu. Time-Slicing has increased the number of allocatable GPUs, which now can be used with Kubernetes limits. Let’s test this out with a sample app from the article above.

apiVersion: batch/v1
kind: Job
metadata:
labels:
app: samples-tf-mnist-demo-ts
name: samples-tf-mnist-demo-ts
spec:
completions: 5
parallelism: 5
completionMode: Indexed
template:
metadata:
labels:
app: samples-tf-mnist-demo-ts
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- Standard_NC24ads_A100_v4
containers:
- name: samples-tf-mnist-demo
image: mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu
args: ["--max_steps", "50000"]
imagePullPolicy: IfNotPresent
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: OnFailure
tolerations:
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"

Run the sample app.

$ kubectl apply -f test-job-ts.yml
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
samples-tf-mnist-demo-ts-0-jslwp 1/1 Running 0 57s
samples-tf-mnist-demo-ts-1-4n22x 1/1 Running 0 57s
samples-tf-mnist-demo-ts-2-6g5mp 1/1 Running 0 57s
samples-tf-mnist-demo-ts-3-ss7c8 0/1 Pending 0 57s
samples-tf-mnist-demo-ts-4-cjffx 1/1 Running 0 57s

You might wonder why one of the jobs is Pending. This is because we have total 4 Allocatable GPUs available on the node, and each pod is set to limit 1 GPU. Let’s verify this on the node level.

root@aks-nc24ads-42333774-vmss000000:~# nvidia-smi
Tue May 28 15:35:02 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000001:00:00.0 Off | 0 |
| N/A 36C P0 71W / 300W | 2622MiB / 81920MiB | 11% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 116682 C python 646MiB |
| 0 N/A N/A 116738 C python 646MiB |
| 0 N/A N/A 116780 C python 646MiB |
| 0 N/A N/A 116873 C python 646MiB |
+---------------------------------------------------------------------------------------+

I can see all 4 processes running on the node level.

Configuring MIG

Unlike Time-Slicing, MIG slices the hardware with pre-defined size called profile. This profile is predetermined by Nvidia which is different between products. Best way to find out is to use the command on the node.

root@aks-nc24ads-42333774-vmss000000:~# nvidia-smi mig -lgip
+-----------------------------------------------------------------------------+
| GPU instance profiles: |
| GPU Name ID Instances Memory P2P SM DEC ENC |
| Free/Total GiB CE JPEG OFA |
|=============================================================================|
| 0 MIG 1g.10gb 19 0/7 9.50 No 14 0 0 |
| 1 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 1g.10gb+me 20 0/1 9.50 No 14 1 0 |
| 1 1 1 |
+-----------------------------------------------------------------------------+
| 0 MIG 1g.20gb 15 0/4 19.50 No 14 1 0 |
| 1 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 2g.20gb 14 0/3 19.50 No 28 1 0 |
| 2 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 3g.40gb 9 0/2 39.25 No 42 2 0 |
| 3 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 4g.40gb 5 0/1 39.25 No 56 2 0 |
| 4 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 7g.80gb 0 0/1 78.75 No 98 5 0 |
| 7 1 1 |
+-----------------------------------------------------------------------------+

Let’s say if I use the profile 1g.10gb, there will be total 7 GPU instances created. In case of 2g.20gb, there will be 3.

Since Time-Slicing is in place already, configuring MIG on the existing setup will make it use both Time-Slicing and MIG simultaneously as described earlier. To avoid the confusion, I will get rid of Time-Slicing first. It can be simply done with single command below. Make sure the Time-Slicing is removed before proceed.

$ kubectl patch clusterpolicies.nvidia.com/cluster-policy -n gpu-operator --type merge -p '{"spec": {"devicePlugin": {"config": {"name": ""}}}}'

If you have thoroughly read this article up to here, you probably understand that setting up GPU sharing is controlled by device plugin, and it works mainly with labels on the node. Just like with Time-Slicing, first step to configure MIG is to label the node. I will label the node with nvidia.com/mig.config and I choose the profile all-1g.10gb. This will give me total 7 GPU instances in return.

$ az aks nodepool update --cluster-name gpu-cluster --resource-group sandbox-rg --nodepool-name nc24ads --labels "nvidia.com/mig.config"="all-1g.10gb"

I also need to setup a cluster-policy.

$ kubectl patch clusterpolicy/cluster-policy -n gpu-operator --type merge -p '{"spec": {"migManager": {"env": [{"name": "WITH_REBOOT", "value": "true"}]}}}'

Important Note here, Ampere GPUs(like A100 that I’m using for the test) MIG setup requires node reboot to take effect, which Time-Slicing does not.

I need to SSH into the node and manually reboot the node using reboot command. Once reboot is complete and node becomes ready, you will be able to see 7 GPU instances as allocatable.

$ k describe node aks-nc24ads-42333774-vmss000000 | grep -y7 Allocatable
cpu: 24
ephemeral-storage: 129886128Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 226764628Ki
nvidia.com/gpu: 7
pods: 30
Allocatable:
cpu: 23660m
ephemeral-storage: 119703055367
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 214295380Ki
nvidia.com/gpu: 7
pods: 30

Go to the node and check the MIG configured. There are total 7 GPU instances (GIIDs) created as shown below.

$ nvidia-smi
Wed May 29 06:43:12 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000001:00:00.0 Off | On |
| N/A 33C P0 75W / 300W | 87MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 7 0 0 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 8 0 1 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 9 0 2 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 10 0 3 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 11 0 4 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 12 0 5 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 13 0 6 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+

Monitoring GPUs with Azure Managed Prometheus, Grafana

The GPU Operator comes with DCGM Exporter which is an exporter for Prometheus to monitor the health and get metrics from GPUs. There is Grafana dashboard created by Nvidia here, but it’s not built for MIG. I’ve created a new Grafana dashboard that supports MIG, which you can find here.

GPU Monitoring with managed Prometheus and Grafana

Wrapping Up

If you’re inquisitive enough, you might have several questions at this point. “Are there different options other than those two discussed in this article?” “What would be the best fit scenario for each configuration?” “What method is the most performant?” Luckily, Nebuly AI team has some answers to your questions. Look at the graph below, and visit their GitHub repository for more details.

https://github.com/nebuly-ai/nos/tree/main/demos/gpu-sharing-comparison

I didn’t have enough time to test MPS, so I will save it for another article. Deciding between different options, Nvidia provides a nice comparison in their blog — https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/

Deciding partition type

Stay tuned for the next series of articles around AI.

If you enjoyed my article, I’d appreciate a few claps or a follow. Get notified for the new articles by subscribing, and let’s stay connected on Linkedin. Thank you for your time and happy reading!

--

--

Jay Lee

Cloud Native Enthusiast. Java, Spring, Python, Golang, Kubernetes.