Run Slurm Workloads Inside Kubernetes With ClearML

December 9, 2025

By Erez Schnaider, Technical Product Marketing Manager, ClearML

Slurm has powered HPC environments for years. It is battle tested, widely adopted, and deeply embedded in research and engineering workflows. Over 60% of the TOP500 supercomputers use it to manage their large infrastructure, orchestrate workloads and schedule jobs, as it is powerful and versatile with over 20 years of engineering behind it.

Entire codebases assume sbatch, srun, and Slurm’s job semantics to run and many teams rely on it to get their jobs to the right compute. But running a traditional Slurm cluster presents real operational challenges, especially for AI teams.

Where Traditional Slurm Clusters Struggle With Modern AI

A classic Slurm deployment works well for long lived, statically defined clusters. Modern AI and mixed HPC workloads behave very differently.

  1. No native autoscaling
    Slurm expects a fixed set of nodes. AI usage patterns swing between idle periods and massive spikes in demand. Without autoscaling, you either overprovision and waste capacity or constantly reconfigure your cluster to keep up.
  2. No native container layer
    Most AI workloads run inside containers (mainly Docker) that package frameworks, libraries, and system dependencies. Slurm does not natively support containers, so teams end up adding custom glue, wrappers, and out-of-band tooling to make containers work.
  3. Rigid networking requirements
    All worker nodes must communicate with the login node. Many HPC jobs (such as ones using MPI) depend on full bidirectional connectivity between nodes. This makes node placement inflexible and complicates multi-tenant environments and hybrid topologies.
  4. Hard to integrate into modern AI toolchains
    Modern AI systems today are not a single batch job. They rely on frameworks, databases, APIs, gateways, UI services and more. Slurm sits outside this ecosystem, so teams write and maintain a growing amount of custom integration code just to keep workflows running.

These issues are added to Slurm’s inherited limitations such as limited visibility into job status and rigid queuing systems. (We’ve expanded on it in this blog post.)

As AI research proliferates within Slurm-centric organizations, the need to support both AI workflows in conjunction with HPC highlights the need to adopt Kubernetes as an orchestrator alongside traditional Slurm deployments.

Why Adding Kubernetes Alone Is Not Enough

To support containerized AI workloads, many teams deploy Kubernetes alongside their existing Slurm cluster. That solves some problems but introduces a new set of challenges.

  1. Two separate systems to manage
    You now operate two infrastructure stacks with different schedulers, policies, and security models. Any attempt to share capacity, enforce quotes, or standardize access has to be implemented twice.
  2. Limited Kubernetes expertise
    Running Kubernetes for GPU heavy or HPC style workloads requires deep operational knowledge. Many Slurm-centric organizations do not have large platform engineering or DevOps teams dedicated to managing multiple clusters.
  3. No cluster-wide visibility
    If Slurm and Kubernetes are separate, there is no single view into utilization, health, and workload mix. It becomes difficult to answer basic questions such as which workloads should move where or how much infrastructure is idle.
  4. Under-utilized resources
    Static allocation of machines to Slurm or Kubernetes almost always leaves capacity stranded. If the Slurm side is busy and Kubernetes is quiet, you cannot easily shift nodes across, and the same is true in reverse.
    The result is a fragmented environment where HPC, AI research, and production AI applications all compete for resources without a unified control plane.

Slurm Over Kubernetes With ClearML

ClearML introduces a different approach. Instead of running a separate, dedicated Slurm cluster, you run Slurm inside Kubernetes as needed.

It’s an elegant solution with substantial impact: spin up a Slurm login node and as many worker nodes as you need, all running in their own Kubernetes pod. As workload demands change, just scale up or down more nodes to meet the demand. If the cluster is no-longer needed, you can tear down the entire cluster with a single click.

This changes Slurm from a static, dedicated environment into an elastic service powered by Kubernetes and managed through ClearML’s AI infrastructure platform.

You keep:

  • The familiar Slurm interface and job semantics
  • Existing sbatch and srun based scripts
  • HPC-oriented workflows and MPI support

You gain:

  • Kubernetes autoscaling and bin-packing
  • Native container support
  • Centralized visibility and policy control through ClearML

How It Works

A typical Slurm over Kubernetes flow with ClearML looks like this:

  1. Cluster definition
    An admin logs into ClearML’s user interface and configures the new Slurm cluster properties. (Such as number of worker nodes, and which resource the cluster will run on)
  2. Login node pod creation
    A Slurm login node pod is created on the designated Kubernetes cluster. This pod exposes the familiar Slurm interfaces and acts as the entry point for jobs.
  3. Worker pods scaling
    Worker node pods are spun up according to configuration. Administrators can scale the nodes up or down as workload demand changes, and this can be automated using ClearML’s scheduling and autoscaling policies.
  4. Networking and MPI connectivity
    ClearML handles network connectivity between Slurm pods so MPI and HPC workloads running on multiple pods can communicate with each other, as well as the login node without manual networking configuration
  5. Job submission and execution
    Users submit jobs to the queue backed by Slurm workers, so the workloads execute on Slurm. These jobs run on worker pods inside Kubernetes, using the container images and resources that have been defined.
  6. Cluster shutdown
    When the job queue is empty and the cluster is not needed, you can shut the entire cluster down in a single action and free idle resources.

From the user perspective, it is still Slurm. From the platform perspective, it is an elastic service running on Kubernetes and controlled by ClearML.

ClearML: Turning Slurm And Kubernetes Into One Managed Platform

Slurm inside Kubernetes unlocks powerful new flexibility. The real value comes when that flexibility is paired with a platform that handles orchestration, scheduling, multi-tenancy, and observability.

Admins and users can benefit from ClearML features such as:

  1. Cross-cluster scheduling
    Schedule jobs for execution, whether on Kubernetes or Slurm from a single interface. Apply routing rules so the platform decides where a job should run based on priority, GPU type, cost profile, or team, without users changing their workflows.
  2. Hybrid infrastructure scheduling and spillover
    Start with an on-prem cluster and spill over to the cloud when demand spikes or SLAs require more capacity. Set quotas per team and enforce RBAC rules that expose each user only to specific clusters, nodes, or GPU types, while still drawing from a shared global pool.
  3. Built-in secure dynamic multi-tenancy with billing included
    Isolate teams, projects, and business units while they share the same physical infrastructure. Apply per-tenant resource limits, track consumption at the user or project level, and export usage data to your billing or chargeback systems so you know exactly who used which resources and when.
  4. Full job tracking
    Track each job with ClearML’s AI Development Center for both job and hardware related metrics. Capture logs, metrics, and artifacts in one place, compare runs across Slurm and Kubernetes backends, and keep a complete history for auditing, debugging, and optimization.
  5. GPU-as-a-Service and Model-as-a-Service built in the platform
    Launch a development environment or a model endpoint with a single click. Spin up notebooks or IDE sessions on top of Slurm or Kubernetes capacity, and promote trained models into managed services that inherit the same security, quotas, and monitoring.

This also means that your Kubernetes-managed and Slurm-managed resources are a part of a single platform that is centrally managed.

The Result: Slurm That Fits The Kubernetes And AI Era

If you do not want to operate a permanent, standalone Slurm cluster or you need to bring HPC style workflows into a Kubernetes first environment, ClearML’s Slurm over Kubernetes capability offers a practical path forward.

You keep the workflow compatibility and HPC ergonomics of Slurm while gaining Kubernetes agility, cost efficiency, and all of ClearML’s platform features which delivers a better experience for both engineering and operations teams.

Facebook
Twitter
LinkedIn
Scroll to Top