How ClearML Helps Teams Get More out of Slurm

May 8, 2024

It is a fairly recent trend for companies to amass GPU firepower to build their own AI computing infrastructure and support the growing number of compute requests. Many recent AI tools now enable data scientists to work on data, run experiments, and train models seamlessly with the ability to submit their jobs and monitor their progress. However, for many organizations with mature supercomputing capabilities, Slurm has been the scheduling tool of choice for managing computing clusters.

In this blog, we will cover how ClearML works with Slurm and the benefits that ClearML delivers for organizations using this powerful scheduling tool.

What Is Slurm?

Slurm is derived from Simple Linux Utility for Resource Management (SLURM), and it is a free and open source Linux-native job scheduler for managing AI and HPC workloads on compute clusters. 

During the process of model development, Slurm can be used for:

  • Allocating compute resources to jobs through complex policy management.
  • Managing the job queue, bin packing, and optimizing the compute on the cluster itself. 
  • Providing the framework for managing the launching of jobs in parallel across multiple nodes, including set-up, execution, and monitoring.

Imagine simultaneously logging into the most optimized configuration of machines needed and running your job in parallel across all of them seamlessly – that’s the magic of Slurm. 

Architectural Diagram

Slurm Architectural Diagram

Why to Use Slurm: Reasons and Use Cases

As an open source Linux-native tool, using Slurm requires a lot of coding and writing scripts, however, unlike Kubernetes, Slurm was developed expressly for the purpose of optimizing the compute within clusters handling resource-intensive AI and HPC jobs. It is a highly scalable, high-performance scheduling tool that supports fair-share scheduling, pre-emptive and gang-scheduling, advanced reservations, and advanced policy management. Newer versions of Slurm also support containers.

Slurm enables AI teams to manage multiple users across their computing clusters with real-time job profiling, budgets, and power consumption with a Reporting API. It also enforces hard time limits for jobs to prevent resources from being consumed by a single workload.

Used by a significant number of powerful supercomputers (as well as six of the top 10 in the world), Slurm is a strong alternative for managing compute clusters (CPU or GPU) running compute-intensive workloads such as simulations, 3D modeling, and digital twin creation. Companies using Slurm are building engines, working with protein molecules, simulating wind tunnel aerodynamics, testing machinery, and designing smart cities digitally, among many other use cases. Think of scenarios that require 100+ CPUs working in parallel on a single job – that is when Slurm shines.

The Challenges of Using Slurm

Alas, if it sounds too good to be true, it usually is. Using Slurm is not without its challenges. First, as mentioned, Slurm requires a lot of scripting in Bash (not even Python). This is quite cumbersome for building pipelines or creating automations because you need a lot of code to help the system make decisions, and sometimes things break.

Slurm is not capable of booting up different environments as needed by the job, and it is very complicated to manage and ensure all of the different data connections, containers, and drivers are available for the users and that each element has cluster access. There is also no warning given when jobs fail. Slurm’s policy management is designed to facilitate workloads from multiple users as efficiently as possible, and jobs that exceed memory quotas or take too long are killed.

Lastly, and most importantly for AI teams, is the lack of visibility into what is happening inside each cluster. Once jobs are sent to Slurm, it is not possible to reprioritize the jobs or even see where they are in the queue. This lack of transparency forces users to simply wait until their job is completed for the result (or killed).

ClearML and Slurm: How it Works

ClearML’s end-to-end AI foundational platform can be used in its entirety or modularly, and we are the first in the market to work on Slurm. 

Data scientists developing AI models can work seamlessly on ClearML’s platform. Managing datasets (including large, unstructured files) and training models can be done efficiently and collaboratively with built-in functionality that automatically logs any changes made, enables logic-driven pipelines, and facilitates access to compute resources without additional intervention from IT.

ClearML’s enterprise-grade security features include SSO authentication, RBAC, and LDAP integration for user management, used in tandem with configuration vaults and policies for complete control over the data, models, and resources that users can access. 

With the integration to Slurm, jobs are placed into queues that have been mapped to templates on Slurm. Then within Slurm, ClearML launches each job with the correct environment (settings, configurations, and parameters), and handles the burden of managing required data connections and drivers. ClearML also ensures jobs do not consume more memory than allowed. For organizations with GPU infrastructure, ClearML further allows AI teams to run multiple jobs in parallel on the same compute resource using our fractional GPU capabilities.

ClearML provides AI teams with full visibility into their queues, even ones destined for Slurm. Jobs within queues are fully accessible and can still be stopped or removed prior to running, and it’s easy to see the status of each job. This level of transparency makes it simple to monitor what is happening within your Slurm cluster. 

Why Slurm + ClearML is a Winning Combination

In addition to the benefits of having your AI team working together on a unified AI foundational platform, ClearML provides organizations with supplemental functionality that doesn’t come with Slurm out of the box, such as:

  • Visibility for monitoring all jobs within your Slurm cluster, understanding the resource utilization per job, and accessing aggregated results with the ability to compare results with plots and scalars.
  • Automation with pipelines, task scheduling capabilities, and the ability to write custom automations without using bash script.
  • Maximum utilization by ensuring full queues that keep the nodes from going idle.
  • Cloud Hybrid possibilities with a system that supports cloud spillover with autoscalers on AWS, GCP, or Azure.

To learn more about ClearML’s Slurm capabilities, request a demo to learn more.