Manage Resource Utilization and Allocation with ClearML

June 27, 2024

Written by Noam Wasersprung, Head of Product at ClearML

Last month we released the Resource Allocation & Policy Management Center to help teams visualize their compute infrastructure and understand which users have access to what resources. This new feature makes it easy for administrators to visualize their resource policies for enabling workload prioritization across available resources. 

The new Resource Allocation & Policy Management Center offers many benefits, such as improved governance through greater visibility and transparency, and the ability to load balance and ensure higher utilization of resources. AI teams now have an easy interface to ensure the highest priorities and fastest resources are allocated for the most important projects. 

This blog post provides an overview of our Resource Allocation & Policy Management Center for organizations interested in improving visibility and control over their AI infrastructure.

ClearML’s Resource Allocation Policy Management Center

Let’s start with definitions and how we see the world:

Resource Pools

A resource pool represents a group of resources of similar characteristics that should be made available for job execution; e.g., an H100 superpod with 128 GPUs or a cloud autoscaler that launches 8 GPU cloud instances.

Administrators define the amount of available resources in each resource pool for the ClearML policy manager to ensure jobs will not consume resources beyond the defined limit.

Resource Profiles

Resource profiles represent the resource consumption requirement by jobs; e.g., jobs requiring 4 GPUs. 

Administrators create resource profiles as an interface for resource policies to provide users with access to the available resource pools based on their job resource requirements. Administrators control the execution priority within a pool across the resource profiles making use of it; e.g., if both a 4 GPU job and an 8 GPU job are pending, have an autoscaler take care of the larger requirement first (or vice versa). 

Administrators control the resource pool allocation precedence within a profile; e.g., only run jobs on a cloud autoscaler if the local H100 cannot currently satisfy the profile’s resource requirements.

Administrators control the queuing priority within a profile across resource policies making use of it e.g. if the R&D team and DevOps teams both have pending jobs, run the R&D teams jobs first (or vice versa).

Resource Policies

ClearML resource policies let administrators define resource quotas and reservations to enable workload prioritization across available resources. They can also set resource reservation and limits for user groups.

The resource policy manager guarantees resource reservations are honored (i.e., the user group will have the reserved amount of resources available for workload execution, at a minimum, if resource usage is at capacity) and enforces the policy limit.

Administrators assign resource profiles to a policy and make them available to its user group via ClearML queues; i.e., jobs enqueued to a queue will be allocated the number of resources defined for its profile.

Creating and Managing Resource Policies in ClearML

Making the Magic Happen Through Quotas, Reservations, and Preemption

To maximize resource utilization, users can run as many jobs as their designated resource policy can provide, up to the policy’s defined limit.

Similarly, a policy’s resource reservation does not “hold-off” resources when they are not in use.

Policies’ resource limits and reservations are taken into account in determining which jobs are assigned resources when job execution is at capacity. When a job is sent for execution and all resources are in use, the ClearML resource policy manager will determine whether it should execute immediately or wait for resources to free up based on the job owner’s resource policy definitions. If currently running jobs by the owner’s user group have already consumed more than the reserved resource amount for its policy, the job will wait until resources free up.

If a reservation needs to be honored, the ClearML resource policy will free up resources by pre-empting currently running jobs. Jobs that have been run beyond their policy reservation will first be considered for pre-emption. 

When pre-empting a running job, the ClearML policy manager can call an ‘abort’ call-back to allow users to accommodate a pre-emption scenario.

ClearML Resource Policy Manager System Design

An Example of Resource Policing

Assume 10 resources are available and we configure resource policies for the R&D and DevOps teams with 5 reserved resources and a 9 resource limit each.

The R&D team runs 5 3-resource jobs. 3 jobs will run (consuming 9 resources and leaving 1 idle) adhering to the resource policy limit, and the rest will be queued until resources are freed up. The DevOps team now needs to run a 3-resource job. Since they have reserved 5 resources, the job last started by the R&D team will be pre-empted to honor this reservation (freeing 3 resources) and the DevOps team’s job will run.

The DevOps team now needs to run a 2-resource job. At this point, only 1 resource is available (3 resources were made available due to pre-emption, which were allocated to honor the DevOps team reservation). Since the DevOps team is still reserved for additional 2 resources and the R&D team is still over their reservation (6 resources consumed by 2 3-resource jobs), an additional R&D team job will be pre-empted to allow this DevOps team job to run.

Conclusion

Hopefully we’ve shown how easy our Resource Allocation & Policy Management Center makes it for IT teams to manage their resource policies for their AI infrastructure to enable workload prioritization. Get greater visibility and transparency into who has access to what resources and make sure you are getting the most out of your expensive GPUs. 

To learn more about how ClearML can help your team get more out of your compute infrastructure through higher utilization and better resource allocation, please request a demo and speak to someone on our sales team.

Facebook
Twitter
LinkedIn