Optimizing AI workloads with NVIDIA GPUs, Time Slicing and Karpenter (Part 2)

Optimizing AI workloads with NVIDIA GPUs, Time Slicing and Karpenter (Part 2)

Introduction: Overcoming GPU Management Challenges

In Part 1 of this blog series, we explored the challenges of hosting large language models (LLM) on CPU-based workloads within an EKS cluster. We discussed the inefficiencies associated with using the CPU for such tasks, mainly due to large model sizes and slower inference speeds. The introduction of GPU resources offered significant performance gains, but also brought the need for efficient management of these high-cost resources.

In this second part, we’ll dive deeper into how to optimize GPU usage for these workloads. We will cover the following key areas:

  • NVIDIA device plugin settings: This section will explain the importance of the NVIDIA Device Plugin for Kubernetes and detail its role in resource discovery, allocation, and isolation.
  • Slicing time: We will discuss how time division allows multiple processes to efficiently share GPU resources and ensure maximum utilization.
  • Auto-scaling nodes with Karpenter: This section will describe how Karpenter dynamically controls node scaling based on real-time demand, optimizing resource utilization and reducing costs.

Challenges solved

  1. Efficient GPU management: Making full use of GPUs to justify their high cost.
  2. Competition management: Enables more efficient sharing of GPU resources among multiple tasks.
  3. Dynamic scaling: Automatically adjust the number of nodes based on workload requirements.

Part 1: Introduction to NVIDIA Device Plugin

The NVIDIA Device Plugin for Kubernetes is a component that simplifies the management and use of NVIDIA GPUs in Kubernetes clusters. Enables Kubernetes to recognize and allocate GPU resources to pods, enabling GPU-accelerated workloads.

Why we need NVIDIA device plugin

  • Resource discovery: Automatically detects NVIDIA GPU resources on each node.
  • Resource allocation: Manages the distribution of GPU resources to modules based on their requests.
  • Isolation: Ensures safe and efficient use of GPU resources between different modules.

The NVIDIA device plugin simplifies GPU management in Kubernetes clusters. It automates the installation NVIDIA Drivers, container toolkitand WEIRDensuring GPU resources are available for workloads without the need for manual setup.

  • NVIDIA driver: Required for nvidia-smi and basic GPU operations. Interface with GPU hardware. The screenshot below shows the output of the nvidia-smi command, which shows key information such as driver version, CUDA version, and detailed GPU configuration, confirming that the GPU is properly configured and ready to use.

  • NVIDIA Container Toolkit: Required to use GPU with containers. Below we see the version version of the container toolkit and the status of the service running on the instance
#Installed Version 
rpm -qa | grep -i nvidia-container-toolkit 
nvidia-container-toolkit-base-1.15.0-1.x86_64 
nvidia-container-toolkit-1.15.0-1.x86_64 
  • WEIRD: Required for GPU-accelerated applications and libraries. Below is the nvcc command output showing the version of CUDA installed on the system:
/usr/local/cuda/bin/nvcc --version 
nvcc: NVIDIA (R) Cuda compiler driver 
Copyright (c) 2005-2023 NVIDIA Corporation 
Built on Tue_Aug_15_22:02:13_PDT_2023 
Cuda compilation tools, release 12.2, V12.2.140 
Build cuda_12.2.r12.2/compiler.33191640_0 

NVIDIA device plugin settings

To ensure that DaemonSet runs exclusively on GPU-based instances, we mark the node with the key “nvidia.com/gpu” and the value “true”. This is achieved by using node affinity, Node selector and Blemishes and tolerances.

Let’s now dive into each of these components in detail.

  • Node Affinity: Node affinity allows pods to be scheduled on nodes based on node labels requiredDuringSchedulingIgnoredDuringExecution: The scheduler cannot schedule a module unless the rule is met and the key is “nvidia.com/gpu” and the operator is “in” and the values ​​are “true”.
affinity: 
    nodeAffinity: 
        requiredDuringSchedulingIgnoredDuringExecution: 
            nodeSelectorTerms: 
                - matchExpressions: 
                    - key: feature.node.kubernetes.io/pci-10de.present 
                      operator: In 
                      values: 
                        - "true" 
                - matchExpressions: 
                    - key: feature.node.kubernetes.io/cpu-model.vendor_id 
                      operator: In 
                      values: 
                      - NVIDIA 
                - matchExpressions: 
                    - key: nvidia.com/gpu 
                      operator: In 
                      values: 
                    - "true" 
  • Node Selector: NOde selector is the simplest recommendation form for restricting the selection of nodes nvidia.com/gpu: “true”
  • Blemishes and tolerances: Tolerances are added to the daemon set to ensure it can be scheduled on broken GPU nodes (nvidia.com/gpu=true:Noschedule).
kubectl taint node ip-10-20-23-199.us-west-1.compute.internal nvidia.com/gpu=true:Noschedule 
kubectl describe node ip-10-20-23-199.us-west-1.compute.internal | grep -i taint 
Taints: nvidia.com/gpu=true:NoSchedule 

tolerations: 
  - effect: NoSchedule 
    key: nvidia.com/gpu 
    operator: Exists 

After implementing node labeling, affinity, node selector, and impurities/tolerances, we can ensure that the daemon suite runs exclusively on GPU-based instances. We can verify the deployment of the NVIDIA device plugin using the following command:

kubectl get ds -n kube-system 
NAME                                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE  NODE SELECTOR                                     AGE 

nvidia-device-plugin                      1         1         1       1            1          nvidia.com/gpu=true                               75d 
nvidia-device-plugin-mps-control-daemon   0         0         0       0            0          nvidia.com/gpu=true,nvidia.com/mps.capable=true   75d 

But the problem is that GPUs are so expensive and it is necessary to ensure the maximum utilization of the GPU and let’s explore more about GPU Concurrency.

GPU Concurrency:

It refers to the ability to run multiple tasks or threads simultaneously on the GPU

  • Single Process: In a single process setup, only one application or container uses the GPU at a time. This approach is straightforward, but can lead to underutilization of GPU resources if the application does not fully load the GPU.
  • Multi-Process Service (MPS): NVIDIA’s Multi-Process Service (MPS) allows multiple CUDA applications to share a single GPU simultaneously, improving GPU utilization and reducing context switching overhead.
  • Time slicing: Time slicing involves dividing GPU time between different processes, in other words, multiple processes take turns on the GPU (Round Robin context Switching)
  • Multi Instance GPU (MIG): MIG is a feature available on the NVIDIA A100 GPU that allows a single GPU to be split into multiple smaller, isolated instances, each with a separate GPU.
  • Virtualization: GPU virtualization allows a single physical GPU to be shared between multiple virtual machines (VMs) or containers, providing each with a virtual GPU.

Part 2: Implementing Time Slicing for GPUs

Time sharing in the context of NVIDIA GPUs and Kubernetes refers to sharing a physical GPU between multiple containers or pods in a Kubernetes cluster. The technology involves dividing GPU processing time into smaller intervals and allocating those intervals to different containers or modules.

  • Slot allocation: The GPU scheduler allocates slots to each vGPU configured on a physical GPU.
  • Preemption and context switching: At the end of a vGPU time slot, the GPU scheduler preempts its execution, saves its context, and switches to the context of the next vGPU.
  • Context switching: The GPU scheduler ensures seamless context switching between vGPUs, minimizing overhead and ensuring efficient use of GPU resources.
  • Task Completion: Processes in containers complete their GPU-accelerated tasks within their allotted time slots.
  • Resource management and monitoring
  • Release resources: When tasks are completed, GPU resources are released back to Kubernetes for redistribution to other pods or containers

Why we need time slicing

  • Cost effectiveness: Ensures that expensive GPUs are not underutilized.
  • Competition: Allows multiple applications to use the GPU simultaneously.

Configuration example for Time Slicing

Let’s use the time division configuration using the configuration map as shown below. Here replicas: 3 specifies the number of replicas for GPU resources, which means that a GPU resource can be split into 3 share instances

apiVersion: v1 
kind: ConfigMap 
metadata: 
  name: nvidia-device-plugin 
  namespace: kube-system 
data: 
  any: |- 
    version: v1 
    flags: 
      migStrategy: none 
    sharing: 
      timeSlicing: 
        resources: 
        - name: nvidia.com/gpu 
          replicas: 3 
#We can verify the GPU resources available on your nodes using the following command:     
kubectl get nodes -o json | jq -r '.items() | select(.status.capacity."nvidia.com/gpu" != null) 
| {name: .metadata.name, capacity: .status.capacity}' 

  "name": "ip-10-20-23-199.us-west-1.compute.internal", 
  "capacity": { 
    "cpu": "4", 
    "ephemeral-storage": "104845292Ki", 
    "hugepages-1Gi": "0", 
    "hugepages-2Mi": "0", 
    "memory": "16069060Ki", 
    "nvidia.com/gpu": "3", 
    "pods": "110" 
  } 

#The above output shows that the node ip-10-20-23-199.us-west-1. compute.internal has 3 virtual GPUs available. 
#We can request GPU resources in their pod specifications by setting resource limits 
resources: 
      limits: 
        cpu: "1" 
        memory: 2G 
        nvidia.com/gpu: "1" 
      requests: 
        cpu: "1" 
        memory: 2G 
        nvidia.com/gpu: "1" 

In our case we may be able to host 3 pods on one node ip-10-20-23-199.us-west-1. calculate. Internally and due to time division, these 3 modules can use 3 virtual GPUs as shown below

The GPUs were virtually shared between the modules and below we can see the assigned PIDS for each of the processes.

Now that we have optimized the GPU at the module level, let’s now focus on optimizing the GPU resources at the node level. We can achieve this using the so-called automatic cluster scaling solution Carpenter. This is especially important because learning labs may not always have constant user load or activity and GPUs are extremely expensive. Leverage Carpenterwe can dynamically scale GPU nodes up or down based on demand, ensuring cost efficiency and optimal resource utilization.

Part 3: Auto Scaling Nodes with Karpenter

Carpenter is open-source node lifecycle management for Kubernetes. Automate provisioning and decommissioning of nodes based on pod scheduling needs, enabling efficient scaling and cost optimization

  • Dynamic Node Provisioning: Automatically scales nodes based on demand.
  • Optimizes resource utilization: Adapts node capacity to workload needs.
  • Reduces operational costs: Minimizes unnecessary expenditure on resources.
  • Improves cluster efficiency: Improves overall performance and responsiveness.

Why use Carpenter for dynamic scaling

  • Dynamic scaling: Automatically adjusts the number of nodes based on workload requirements.
  • Cost optimization: Ensures that resources are provided only when needed, reducing costs.
  • Effective resource management: It tracks modules that cannot be scheduled due to lack of resources, checks their requirements, provisioning nodes that satisfy them, schedules the modules, and decommissions nodes when they are redundant.

Installing Carpenter:

 #Install Karpenter using HELM:
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version "${KARPENTER_VERSION}" 
--namespace "${KARPENTER_NAMESPACE}" --create-namespace   --set "settings.clusterName=${CLUSTER_NAME}"    
--set "settings.interruptionQueue=${CLUSTER_NAME}"    --set controller.resources.requests.cpu=1    
--set controller.resources.requests.memory=1Gi    --set controller.resources.limits.cpu=1    
--set controller.resources.limits.memory=1Gi 

#Verify Karpenter Installation: 
kubectl get pod -n kube-system | grep -i karpenter 
karpenter-7df6c54cc-rsv8s             1/1     Running   2 (10d ago)   53d 
karpenter-7df6c54cc-zrl9n             1/1     Running   0             53d 

Configuring Carpenter with NodePools and NodeClasses:

Karpenter can be configured with NodePools and NodeClasses to automate the provisioning and scaling of nodes based on the specific needs of your workloads

  • Carpenter NodePool: A nodepool is a custom resource that defines a set of nodes with shared specifications and constraints in a Kubernetes cluster. Karpenter uses NodePools to dynamically manage and scale node resources based on the demands of running jobs
apiVersion: karpenter.sh/v1beta1 
kind: NodePool 
metadata: 
  name: g4-nodepool 
spec: 
  template: 
    metadata: 
      labels: 
        nvidia.com/gpu: "true" 
    spec: 
      taints: 
        - effect: NoSchedule 
          key: nvidia.com/gpu 
          value: "true" 
      requirements: 
        - key: kubernetes.io/arch 
          operator: In 
          values: ("amd64") 
        - key: kubernetes.io/os 
          operator: In 
          values: ("linux") 
        - key: karpenter.sh/capacity-type 
          operator: In 
          values: ("on-demand") 
        - key: node.kubernetes.io/instance-type 
          operator: In 
          values: ("g4dn.xlarge" ) 
      nodeClassRef: 
        apiVersion: karpenter.k8s.aws/v1beta1 
        kind: EC2NodeClass 
        name: g4-nodeclass 
  limits: 
    cpu: 1000 
  disruption: 
    expireAfter: 120m 
    consolidationPolicy: WhenUnderutilized 
  • NodeClasses are configurations that define the characteristics and parameters for the nodes that Karpenter can provide in a Kubernetes cluster. The NodeClass specifies basic infrastructure details for nodes, such as instance types, launch template configurations, and specific cloud provider settings.

Note: The userData section contains scripts to bootstrap EC2 instances, including downloading the TensorFlow GPU Docker image and configuring the instance to connect to the Kubernetes cluster.

apiVersion: karpenter.k8s.aws/v1beta1 
kind: EC2NodeClass 
metadata: 
  name: g4-nodeclass 
spec: 
  amiFamily: AL2 
  launchTemplate: 
    name: "ack_nodegroup_template_new" 
    version: "7"  
  role: "KarpenterNodeRole" 
  subnetSelectorTerms: 
    - tags: 
        karpenter.sh/discovery: "nextgen-learninglab" 
  securityGroupSelectorTerms: 
    - tags: 
        karpenter.sh/discovery: "nextgen-learninglab"     
  blockDeviceMappings: 
    - deviceName: /dev/xvda 
      ebs: 
        volumeSize: 100Gi 
        volumeType: gp3 
        iops: 10000 
        encrypted: true 
        deleteOnTermination: true 
        throughput: 125 
  tags: 
    Name: Learninglab-Staging-Auto-GPU-Node 
  userData: | 
        MIME-Version: 1.0 
        Content-Type: multipart/mixed; boundary="//" 
        --// 
        Content-Type: text/x-shellscript; charset="us-ascii" 
        set -ex 
        sudo ctr -n=k8s.io image pull docker.io/tensorflow/tensorflow:2.12.0-gpu 
        --// 
        Content-Type: text/x-shellscript; charset="us-ascii" 
        B64_CLUSTER_CA=" " 
        API_SERVER_URL="" 
        /etc/eks/bootstrap.sh nextgen-learninglab-eks --kubelet-extra-args '--node-labels=eks.amazonaws.com/capacityType=ON_DEMAND 
--pod-max-pids=32768 --max-pods=110' -- b64-cluster-ca $B64_CLUSTER_CA --apiserver-endpoint $API_SERVER_URL --use-max-pods false 
         --// 
        Content-Type: text/x-shellscript; charset="us-ascii" 
        KUBELET_CONFIG=/etc/kubernetes/kubelet/kubelet-config.json 
        echo "$(jq ".podPidsLimit=32768" $KUBELET_CONFIG)" > $KUBELET_CONFIG 
        --// 
        Content-Type: text/x-shellscript; charset="us-ascii" 
        systemctl stop kubelet 
        systemctl daemon-reload 
        systemctl start kubelet
        --//--

In this scenario, each node (e.g. ip-10-20-23-199.us-west-1.compute.internal) holds up to three capsules. If the deployment is scaled to add another module, resources will be insufficient, causing the new module to remain in the backlog.

Karpenter monitors these Un-schedulable modules and evaluates their resource requirements to act accordingly. There will be a nodeclaim that claims a node from the node pool, and Karpenter will thus provide a node upon request.

Conclusion: Effective management of GPU resources in Kubernetes

With the increasing demand for GPU-accelerated workloads in Kubernetes, efficient management of GPU resources is critical. Combination NVIDIA Device Plugin, slicing timeand Carpenter provides a powerful approach to managing, optimizing, and scaling GPU resources in a Kubernetes cluster, delivering high performance with efficient resource utilization. This solution was implemented to host GPU-enabled pilot learning labs developer.cisco.com/learningproviding a learning experience with GPUs.

Share:

Leave a Reply

Your email address will not be published. Required fields are marked *