Introduction: Overcoming GPU Management Challenges
In Part 1 of this blog series, we explored the challenges of hosting large language models (LLM) on CPU-based workloads within an EKS cluster. We discussed the inefficiencies associated with using the CPU for such tasks, mainly due to large model sizes and slower inference speeds. The introduction of GPU resources offered significant performance gains, but also brought the need for efficient management of these high-cost resources.
In this second part, we’ll dive deeper into how to optimize GPU usage for these workloads. We will cover the following key areas:
- NVIDIA device plugin settings: This section will explain the importance of the NVIDIA Device Plugin for Kubernetes and detail its role in resource discovery, allocation, and isolation.
- Slicing time: We will discuss how time division allows multiple processes to efficiently share GPU resources and ensure maximum utilization.
- Auto-scaling nodes with Karpenter: This section will describe how Karpenter dynamically controls node scaling based on real-time demand, optimizing resource utilization and reducing costs.
Challenges solved
- Efficient GPU management: Making full use of GPUs to justify their high cost.
- Competition management: Enables more efficient sharing of GPU resources among multiple tasks.
- Dynamic scaling: Automatically adjust the number of nodes based on workload requirements.
Part 1: Introduction to NVIDIA Device Plugin
The NVIDIA Device Plugin for Kubernetes is a component that simplifies the management and use of NVIDIA GPUs in Kubernetes clusters. Enables Kubernetes to recognize and allocate GPU resources to pods, enabling GPU-accelerated workloads.
Why we need NVIDIA device plugin
- Resource discovery: Automatically detects NVIDIA GPU resources on each node.
- Resource allocation: Manages the distribution of GPU resources to modules based on their requests.
- Isolation: Ensures safe and efficient use of GPU resources between different modules.
The NVIDIA device plugin simplifies GPU management in Kubernetes clusters. It automates the installation NVIDIA Drivers, container toolkitand WEIRDensuring GPU resources are available for workloads without the need for manual setup.
- NVIDIA driver: Required for nvidia-smi and basic GPU operations. Interface with GPU hardware. The screenshot below shows the output of the nvidia-smi command, which shows key information such as driver version, CUDA version, and detailed GPU configuration, confirming that the GPU is properly configured and ready to use.
- NVIDIA Container Toolkit: Required to use GPU with containers. Below we see the version version of the container toolkit and the status of the service running on the instance
#Installed Version rpm -qa | grep -i nvidia-container-toolkit nvidia-container-toolkit-base-1.15.0-1.x86_64 nvidia-container-toolkit-1.15.0-1.x86_64
- WEIRD: Required for GPU-accelerated applications and libraries. Below is the nvcc command output showing the version of CUDA installed on the system:
/usr/local/cuda/bin/nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Aug_15_22:02:13_PDT_2023 Cuda compilation tools, release 12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0
NVIDIA device plugin settings
To ensure that DaemonSet runs exclusively on GPU-based instances, we mark the node with the key “nvidia.com/gpu” and the value “true”. This is achieved by using node affinity, Node selector and Blemishes and tolerances.
Let’s now dive into each of these components in detail.
- Node Affinity: Node affinity allows pods to be scheduled on nodes based on node labels requiredDuringSchedulingIgnoredDuringExecution: The scheduler cannot schedule a module unless the rule is met and the key is “nvidia.com/gpu” and the operator is “in” and the values are “true”.
affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: feature.node.kubernetes.io/pci-10de.present operator: In values: - "true" - matchExpressions: - key: feature.node.kubernetes.io/cpu-model.vendor_id operator: In values: - NVIDIA - matchExpressions: - key: nvidia.com/gpu operator: In values: - "true"
- Node Selector: NOde selector is the simplest recommendation form for restricting the selection of nodes nvidia.com/gpu: “true”
- Blemishes and tolerances: Tolerances are added to the daemon set to ensure it can be scheduled on broken GPU nodes (nvidia.com/gpu=true:Noschedule).
kubectl taint node ip-10-20-23-199.us-west-1.compute.internal nvidia.com/gpu=true:Noschedule kubectl describe node ip-10-20-23-199.us-west-1.compute.internal | grep -i taint Taints: nvidia.com/gpu=true:NoSchedule tolerations: - effect: NoSchedule key: nvidia.com/gpu operator: Exists
After implementing node labeling, affinity, node selector, and impurities/tolerances, we can ensure that the daemon suite runs exclusively on GPU-based instances. We can verify the deployment of the NVIDIA device plugin using the following command:
kubectl get ds -n kube-system NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE nvidia-device-plugin 1 1 1 1 1 nvidia.com/gpu=true 75d nvidia-device-plugin-mps-control-daemon 0 0 0 0 0 nvidia.com/gpu=true,nvidia.com/mps.capable=true 75d
But the problem is that GPUs are so expensive and it is necessary to ensure the maximum utilization of the GPU and let’s explore more about GPU Concurrency.
GPU Concurrency:
It refers to the ability to run multiple tasks or threads simultaneously on the GPU
- Single Process: In a single process setup, only one application or container uses the GPU at a time. This approach is straightforward, but can lead to underutilization of GPU resources if the application does not fully load the GPU.
- Multi-Process Service (MPS): NVIDIA’s Multi-Process Service (MPS) allows multiple CUDA applications to share a single GPU simultaneously, improving GPU utilization and reducing context switching overhead.
- Time slicing: Time slicing involves dividing GPU time between different processes, in other words, multiple processes take turns on the GPU (Round Robin context Switching)
- Multi Instance GPU (MIG): MIG is a feature available on the NVIDIA A100 GPU that allows a single GPU to be split into multiple smaller, isolated instances, each with a separate GPU.
- Virtualization: GPU virtualization allows a single physical GPU to be shared between multiple virtual machines (VMs) or containers, providing each with a virtual GPU.
Part 2: Implementing Time Slicing for GPUs
Time sharing in the context of NVIDIA GPUs and Kubernetes refers to sharing a physical GPU between multiple containers or pods in a Kubernetes cluster. The technology involves dividing GPU processing time into smaller intervals and allocating those intervals to different containers or modules.
- Slot allocation: The GPU scheduler allocates slots to each vGPU configured on a physical GPU.
- Preemption and context switching: At the end of a vGPU time slot, the GPU scheduler preempts its execution, saves its context, and switches to the context of the next vGPU.
- Context switching: The GPU scheduler ensures seamless context switching between vGPUs, minimizing overhead and ensuring efficient use of GPU resources.
- Task Completion: Processes in containers complete their GPU-accelerated tasks within their allotted time slots.
- Resource management and monitoring
- Release resources: When tasks are completed, GPU resources are released back to Kubernetes for redistribution to other pods or containers
Why we need time slicing
- Cost effectiveness: Ensures that expensive GPUs are not underutilized.
- Competition: Allows multiple applications to use the GPU simultaneously.
Configuration example for Time Slicing
Let’s use the time division configuration using the configuration map as shown below. Here replicas: 3 specifies the number of replicas for GPU resources, which means that a GPU resource can be split into 3 share instances
apiVersion: v1 kind: ConfigMap metadata: name: nvidia-device-plugin namespace: kube-system data: any: |- version: v1 flags: migStrategy: none sharing: timeSlicing: resources: - name: nvidia.com/gpu replicas: 3 #We can verify the GPU resources available on your nodes using the following command: kubectl get nodes -o json | jq -r '.items() | select(.status.capacity."nvidia.com/gpu" != null) | {name: .metadata.name, capacity: .status.capacity}' { "name": "ip-10-20-23-199.us-west-1.compute.internal", "capacity": { "cpu": "4", "ephemeral-storage": "104845292Ki", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "16069060Ki", "nvidia.com/gpu": "3", "pods": "110" } } #The above output shows that the node ip-10-20-23-199.us-west-1. compute.internal has 3 virtual GPUs available. #We can request GPU resources in their pod specifications by setting resource limits resources: limits: cpu: "1" memory: 2G nvidia.com/gpu: "1" requests: cpu: "1" memory: 2G nvidia.com/gpu: "1"
In our case we may be able to host 3 pods on one node ip-10-20-23-199.us-west-1. calculate. Internally and due to time division, these 3 modules can use 3 virtual GPUs as shown below
The GPUs were virtually shared between the modules and below we can see the assigned PIDS for each of the processes.
Now that we have optimized the GPU at the module level, let’s now focus on optimizing the GPU resources at the node level. We can achieve this using the so-called automatic cluster scaling solution Carpenter. This is especially important because learning labs may not always have constant user load or activity and GPUs are extremely expensive. Leverage Carpenterwe can dynamically scale GPU nodes up or down based on demand, ensuring cost efficiency and optimal resource utilization.
Part 3: Auto Scaling Nodes with Karpenter
Carpenter is open-source node lifecycle management for Kubernetes. Automate provisioning and decommissioning of nodes based on pod scheduling needs, enabling efficient scaling and cost optimization
- Dynamic Node Provisioning: Automatically scales nodes based on demand.
- Optimizes resource utilization: Adapts node capacity to workload needs.
- Reduces operational costs: Minimizes unnecessary expenditure on resources.
- Improves cluster efficiency: Improves overall performance and responsiveness.
Why use Carpenter for dynamic scaling
- Dynamic scaling: Automatically adjusts the number of nodes based on workload requirements.
- Cost optimization: Ensures that resources are provided only when needed, reducing costs.
- Effective resource management: It tracks modules that cannot be scheduled due to lack of resources, checks their requirements, provisioning nodes that satisfy them, schedules the modules, and decommissions nodes when they are redundant.
Installing Carpenter:
#Install Karpenter using HELM: helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version "${KARPENTER_VERSION}" --namespace "${KARPENTER_NAMESPACE}" --create-namespace --set "settings.clusterName=${CLUSTER_NAME}" --set "settings.interruptionQueue=${CLUSTER_NAME}" --set controller.resources.requests.cpu=1 --set controller.resources.requests.memory=1Gi --set controller.resources.limits.cpu=1 --set controller.resources.limits.memory=1Gi #Verify Karpenter Installation: kubectl get pod -n kube-system | grep -i karpenter karpenter-7df6c54cc-rsv8s 1/1 Running 2 (10d ago) 53d karpenter-7df6c54cc-zrl9n 1/1 Running 0 53d
Configuring Carpenter with NodePools and NodeClasses:
Karpenter can be configured with NodePools and NodeClasses to automate the provisioning and scaling of nodes based on the specific needs of your workloads
- Carpenter NodePool: A nodepool is a custom resource that defines a set of nodes with shared specifications and constraints in a Kubernetes cluster. Karpenter uses NodePools to dynamically manage and scale node resources based on the demands of running jobs
apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: g4-nodepool spec: template: metadata: labels: nvidia.com/gpu: "true" spec: taints: - effect: NoSchedule key: nvidia.com/gpu value: "true" requirements: - key: kubernetes.io/arch operator: In values: ("amd64") - key: kubernetes.io/os operator: In values: ("linux") - key: karpenter.sh/capacity-type operator: In values: ("on-demand") - key: node.kubernetes.io/instance-type operator: In values: ("g4dn.xlarge" ) nodeClassRef: apiVersion: karpenter.k8s.aws/v1beta1 kind: EC2NodeClass name: g4-nodeclass limits: cpu: 1000 disruption: expireAfter: 120m consolidationPolicy: WhenUnderutilized
- NodeClasses are configurations that define the characteristics and parameters for the nodes that Karpenter can provide in a Kubernetes cluster. The NodeClass specifies basic infrastructure details for nodes, such as instance types, launch template configurations, and specific cloud provider settings.
Note: The userData section contains scripts to bootstrap EC2 instances, including downloading the TensorFlow GPU Docker image and configuring the instance to connect to the Kubernetes cluster.
apiVersion: karpenter.k8s.aws/v1beta1 kind: EC2NodeClass metadata: name: g4-nodeclass spec: amiFamily: AL2 launchTemplate: name: "ack_nodegroup_template_new" version: "7" role: "KarpenterNodeRole" subnetSelectorTerms: - tags: karpenter.sh/discovery: "nextgen-learninglab" securityGroupSelectorTerms: - tags: karpenter.sh/discovery: "nextgen-learninglab" blockDeviceMappings: - deviceName: /dev/xvda ebs: volumeSize: 100Gi volumeType: gp3 iops: 10000 encrypted: true deleteOnTermination: true throughput: 125 tags: Name: Learninglab-Staging-Auto-GPU-Node userData: | MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="//" --// Content-Type: text/x-shellscript; charset="us-ascii" set -ex sudo ctr -n=k8s.io image pull docker.io/tensorflow/tensorflow:2.12.0-gpu --// Content-Type: text/x-shellscript; charset="us-ascii" B64_CLUSTER_CA=" " API_SERVER_URL="" /etc/eks/bootstrap.sh nextgen-learninglab-eks --kubelet-extra-args '--node-labels=eks.amazonaws.com/capacityType=ON_DEMAND --pod-max-pids=32768 --max-pods=110' -- b64-cluster-ca $B64_CLUSTER_CA --apiserver-endpoint $API_SERVER_URL --use-max-pods false --// Content-Type: text/x-shellscript; charset="us-ascii" KUBELET_CONFIG=/etc/kubernetes/kubelet/kubelet-config.json echo "$(jq ".podPidsLimit=32768" $KUBELET_CONFIG)" > $KUBELET_CONFIG --// Content-Type: text/x-shellscript; charset="us-ascii" systemctl stop kubelet systemctl daemon-reload systemctl start kubelet --//--
In this scenario, each node (e.g. ip-10-20-23-199.us-west-1.compute.internal) holds up to three capsules. If the deployment is scaled to add another module, resources will be insufficient, causing the new module to remain in the backlog.
Karpenter monitors these Un-schedulable modules and evaluates their resource requirements to act accordingly. There will be a nodeclaim that claims a node from the node pool, and Karpenter will thus provide a node upon request.
Conclusion: Effective management of GPU resources in Kubernetes
With the increasing demand for GPU-accelerated workloads in Kubernetes, efficient management of GPU resources is critical. Combination NVIDIA Device Plugin, slicing timeand Carpenter provides a powerful approach to managing, optimizing, and scaling GPU resources in a Kubernetes cluster, delivering high performance with efficient resource utilization. This solution was implemented to host GPU-enabled pilot learning labs developer.cisco.com/learningproviding a learning experience with GPUs.
Share: