Maximize your use of the Model Development Accelerator with the new Amazon SageMaker HyperPod Task Manager

Today, we’re announcing the general availability of Amazon SageMaker HyperPod job management, a new innovation for easy and central management and maximizing the use of GPUs and Trainium in generative AI model development tasks such as training, tuning, and inference.

Customers tell us that they are rapidly increasing investment in generative AI projects, but are facing challenges in effectively allocating limited computing resources. The lack of dynamic, centralized management for resource allocation leads to inefficiencies, with some projects not using resources while others stall. This situation burdens administrators with constant rescheduling, causes delays for data scientists and developers, and leads to premature delivery of AI innovations and cost overruns due to inefficient use of resources.

With SageMaker HyperPod job management, you can accelerate time to market for AI innovations while avoiding cost overruns due to unused computing resources. In a few steps, administrators can set quotas that control the allocation of computing resources based on project budgets and task priorities. Data scientists or developers can create tasks such as model training, tuning, or evaluation that SageMaker HyperPod automatically schedules and executes within allocated quotas.

SageMaker HyperPod Task Management manages resources and automatically frees up computation from lower-priority tasks when high-priority tasks require immediate attention. It does this by suspending low-priority training tasks, saving checkpoints, and resuming them later when resources become available. In addition, idle calculation within a team’s quota can be automatically used to accelerate another team’s pending tasks.

Data scientists and developers can continuously monitor their task queues, view backlogs and adjust priorities as needed. Administrators can also monitor and audit scheduled tasks and calculate resource usage across teams and projects, and as a result, adjust allocations to optimize costs and improve resource availability across the organization. This approach promotes timely completion of critical projects while maximizing resource efficiency.

Getting started with SageMaker HyperPod job management
Task management is available for Amazon EKS clusters in HyperPod. Find Cluster management under HyperPod Clusters in the Amazon SageMaker AI console for provisioning and managing clusters. As an administrator, you can streamline the operation and scaling of HyperPod clusters through this console.

When you select a HyperPod cluster, you can see a new one Dashboard, Tasksand Principles on the cluster details page.

1. New dashboard
In the new dashboard, you can see an overview of cluster utilization, team metrics, and task-based metrics.

First, you can view point-in-time and trend-based metrics for critical compute resources, including GPU, vCPU, and memory utilization, across all instance groups.

You can also get a comprehensive overview of team-specific resource management, focusing on GPU usage versus computing allocation between teams. You can use customizable filters for teams and groups of cluster instances to analyze metrics such as allocated GPU/CPU for tasks, borrowed GPU/CPU, and GPU/CPU utilization.

You can also evaluate job performance and resource allocation efficiency using metrics such as the number of running, pending, and preempted jobs, as well as average job runtime and wait time. To get comprehensive traceability of your SageMaker HyperPod cluster resources and software components, you can integrate with Amazon CloudWatch Container Insights or Amazon Managed Grafana.

2. Create and manage a cluster policy
To enable task prioritization and fair-share resource allocation, you can configure a cluster policy that prioritizes critical tasks and distributes idle compute among teams defined in compute allocation.

To configure priority classes and fair sharing of borrowed compute in the cluster settings, select Edit in Cluster policy section.

You can define how queued jobs are received to prioritize jobs: First come, first served standard or Assessment of tasks. When you choose to classify jobs, queued jobs will be accepted in the priority order defined in this cluster policy. Tasks of the same priority class will be executed on a first-come, first-served basis.

You can also configure how idle computing is allocated between teams: First come, first served gold A fair share by default. The fair-share setting allows teams to borrow idle compute based on their assigned weights, which are configured in relative compute allocations. This allows each team to get a fair share of idle compute to speed up their pending tasks.

IN Calculate allocation section Principles On the page, you can create and edit compute allocations to distribute compute resources between teams, enable settings that allow teams to borrow and borrow idle compute, configure preemption of their own low-priority tasks, and assign fair-share weights to teams.

IN Team section, set a team name and a corresponding Kubernetes namespace will be created for your data science and machine learning (ML) teams. You can set a fair share weight to distribute unused capacity more fairly among your teams and enable preemption based on task priority, allowing higher priority tasks to preempt lower priority ones.

IN Computer you can add and assign instance type quotas to teams. Additionally, you can allocate quotas for instance types that are not yet available in the cluster, allowing for future expansion.

You can enable teams to share idle computing resources by allowing them to lend unused capacity to other teams. This borrowing model is reciprocal: teams can borrow idle compute only if they are also willing to share their own idle resources with others. You can also specify a borrowing limit to allow teams to borrow compute resources above their allocated quota.

3. Run your training task on the SageMaker HyperPod cluster
As a data scientist, you can submit a training job and use the quota assigned to your team using the HyperPod Command Line Interface (CLI) command. With the HyperPod CLI, you can run a job and specify the corresponding namespace that has an allocation.

$ hyperpod start-job --name smpv2-llama2 --namespace hyperpod-ns-ml-engineers
Successfully created job smpv2-llama2
$ hyperpod list-jobs --all-namespaces
{
 "jobs": (
  {
   "Name": "smpv2-llama2",
   "Namespace": "hyperpod-ns-ml-engineers",
   "CreationTime": "2024-09-26T07:13:06Z",
   "State": "Running",
   "Priority": "fine-tuning-priority"
  },
  ...
 )
}

IN Tasks you can see all jobs in your cluster. Each task has a different priority and capacity need according to its policy. If you start another task with a higher priority, the current task will be suspended and that task can be started first.

Okay, now let’s look at a sample video that shows what happens when a high priority training task is added while a low priority task is running.

To learn more, visit SageMaker HyperPod task governance in the Amazon SageMaker AI Developer Guide.

Now available
Amazon SageMaker HyperPod Job Management is now available in the US East (N. Virginia), US East (Ohio), US West (Oregon) AWS Regions. You can use HyperPod Task Manager at no additional cost. To learn more, visit the SageMaker HyperPod product page.

Try controlling HyperPod jobs in the Amazon SageMaker AI console and submit feedback to AWS re:Post for SageMaker or through your usual AWS support contacts.

— Channy

PS Special thanks to Nisha Nadkarni, Lead Architect of Generative AI Solutions at AWS for her contribution to the HyperPod testbed.

Maximize your use of the Model Development Accelerator with the new Amazon SageMaker HyperPod Task Manager | Amazon Web Services

Leave a Comment Cancel reply