ClearML Supports Seamless Orchestration and Infrastructure Management for Kubernetes, Slurm, PBS, and Bare Metal

May 8, 2024

Our early roadmap in 2024 has been largely focused on improving orchestration and compute infrastructure management capabilities. Last month we released a Resource Allocation Policy Management Control Center with a new, streamlined UI to help teams visualize their compute infrastructure and understand which users have access to what resources. We also enabled fractional GPU capabilities for all NVIDIA GPUs (old and new) in our open source version of ClearML available on GitHub, so now all self-hosted ClearML users can take advantage of GPU slicing and maximize the utilization of their hardware. You can read more about it in our previous blog post.

Now we are expanding our focus on reducing the overhead of managing and controlling AI infrastructure through visibility and simplification. ClearML offers scalability with unprecedented ease and efficiency. We support workloads running Kubernetes, Slurm, PBS, or bare metal, making our platform the most comprehensive tool available for managing AI infrastructures built for AI and HPC workflows. ClearML users can now seamlessly build, train, and deploy their AI and machine learning workflows with unparalleled ease and efficiency, at any scale on any AI infrastructure.

 

ClearML and Kubernetes

ClearML works with Kubernetes or Kubernetes-as-a-Service as a client on your cluster.  

Kubernetes (K8s) is a powerful platform used to manage containerized applications for IT services across the computing infrastructure and often used by AI teams for their training and inference workloads. Designed to ensure the uptime of services, management of Kubernetes (especially as required by AI teams) is complex and delicate. Access is typically limited to DevOps and giving access to others is strongly discouraged due to the risk of crashing the entire system.

ClearML facilitates stakeholder access to compute clusters without the need to expose direct access to the Kubernetes cluster. The platform provides AI builders with a seamless end-to-end experience for the entire AI lifecycle. Teams can pre-process data, train/fine-tune models, and deploy them into their Kubernetes production environment on the same platform, using the same infrastructure with minimal friction. Our Kubernetes-native solution also expands the K8s provisioning capabilities with scheduling, priority, quota management, and dynamic fractional GPU capabilities, in addition to user and multi-tenancy management. Once admins set up role-based access control and stored credentials, AI builders can self-serve compute and manage their own AI/HPC workloads as permitted by resource allocation policies. Data and results from all jobs are logged, sent to the desired storage location, and centrally accessible through the ClearML web interface.

With ClearML, DevOps can essentially stop spending time on menial AI team requests, entrusting the provisioning of machines to ClearML, and focusing their time on more important tasks such as optimizing compute utilization through fractional GPUs or improved quota and scheduling policies. Admins also have the added benefits of launching AI jobs from outside of the K8s cluster or adding cloud spillover to expand capabilities beyond on-prem/cloud zone compute.

ClearML and Slurm/PBS

ClearML is the first AI platform in the market to work with Slurm and Altair PBS. 

Slurm and PBS are high-performance scheduling tools that enable pre-emptive and gang scheduling, fair-share scheduling, advanced reservations, and advanced policy management. Slurm and PBS clusters are highly scalable and their software is frequently used across the world’s top supercomputers due to being developed for the sole purpose of optimizing compute clusters handling resource-intensive AI/HPC jobs. However, working with Slurm/PBS is challenging due to the tools’ lack of visibility into what is actually happening across the cluster and the difficulty around setting up the job environment and integrating with external automation like CI/CD workflows. 

With ClearML, data scientists developing AI models can work seamlessly on running experiments, training models, manipulating datasets, and using pipelines and automations. Jobs are then placed into queues that are mapped to Slurm/PBS template jobs. Upon job execution, the ClearML Agent orchestrates the environment setup to launch the job. To prevent jobs from failing on Slurm, ClearML handles many of the small details needed. For example, ClearML sets up each job with the correct environment (parameters, configurations) and working data connections (and even containers via Singularity). ClearML also monitors the job’s machine utilization (CPU/GPU/memory, etc.) to provide insights on resource utilization of the cluster. 

AI Teams have full visibility into their ClearML queues (even ones being sent to Slurm/PBS). Data scientists and admins can access the jobs within queues and stop or remove them from the queue in case priorities have changed, and it’s easy to see the status of each job. With ClearML, teams can not only see the jobs across single or multiple Slurm/PBS clusters, but can also make changes to them.

ClearML and Bare Metal

For AI teams who prefer the least amount of overhead and admin, ClearML makes bare metal work for you.

Teams building AI on ClearML can easily take advantage of bare metal machines (including their own workstations) for running AI workloads. The ClearML Agent can be installed on bare metal, instantly making those machines available for associating with queues to run jobs. The solution is truly plug-and-play. ClearML supports containerized applications and virtual environment applications without the need to customize network setup or configure firewalls, enabling a truly simple solution for teams requiring fully accessible remote GPU machines without the complexity of managing a large scale cluster.

Conclusion

With ClearML, AI teams gain control and visibility over what resources (or fractions of resources) each person can access and easily self-serve compute without changing any part of their day-to-day workflows. 

As compute infrastructures face growing demands in 2024, extracting the most out of each GPU  will be a priority. With our latest product updates, organizations will be able to dramatically optimize their compute infrastructure. To learn more, request a demo to speak with someone on our sales team.

Facebook
Twitter
LinkedIn