ClearML + NVIDIA Dynamo: A Production Control Plane for Distributed AI Inference at Scale

April 9, 2026

By Adam Wolf

NVIDIA Dynamo 1.0 is a datacenter-scale inference orchestration framework that turns clusters of GPUs into a coordinated serving system. ClearML provides the operational and security layer that makes it deployable in enterprise production. Here is how it all works.

Why Inference Has Become the Hard Problem

Training large language models is an expensive, batch-oriented process. Running them in production is a different problem entirely, one that compounds in difficulty as models grow larger, request volumes increase, and latency expectations tighten.

A single 70-billion-parameter model already pushes the limits of what a single GPU node can serve efficiently. Models in the 400B+ range (DeepSeek-R1, Qwen3-Coder 480B) require multi-node deployments by necessity. And as AI applications move from demos into production, organizations are discovering that simply running a vLLM or SGLang instance and pointing traffic at it doesn’t scale: throughput degrades under load, prefill computation gets repeated across requests that share context, decode latency spikes when GPU memory fills up, and there’s no mechanism to elastically right-size the serving fleet against demand.

This is the gap NVIDIA Dynamo 1.0 was built to fill and where the ClearML integration becomes the bridge between a powerful inference framework and a production enterprise deployment.

What NVIDIA Dynamo Actually Is

NVIDIA Dynamo (open source, Apache 2.0) is a datacenter-scale distributed inference serving framework. The key distinction from inference engines like SGLang, TensorRT-LLM, or vLLM is that Dynamo sits above them; it doesn’t replace them, rather it orchestrates them. Think of it as the coordination layer that turns a collection of individual inference engines running across many nodes into a single coherent serving system.

When to use Dynamo

Dynamo is the right choice when: you’re serving LLMs across multiple GPUs or nodes and need to coordinate them, you want to avoid redundant prefill computation through KV-aware routing, you need to independently scale prefill and decode phases, or you need autoscaling that meets latency SLAs at minimum infrastructure cost. If you’re running a single model on a single GPU, your inference engine alone is probably sufficient.

Dynamo 1.0 reached general availability in March 2026. It is built in Rust for performance and Python for extensibility and supports SGLang, TensorRT-LLM, and vLLM as backend inference engines.

The Core Technical Architecture

To understand why ClearML’s integration matters, it helps to first understand what Dynamo is doing at the infrastructure level. Dynamo’s architecture is organized around several interconnected capabilities that solve distinct inference scaling problems.

1. Disaggregated Prefill and Decode

LLM inference has two fundamentally different compute phases. Prefill processes the entire input prompt in one parallel computation; it’s compute-bound and produces the KV cache entries for the whole context. Decode generates one token at a time sequentially; it’s memory-bandwidth-bound and highly sensitive to latency.

When both phases run on the same GPU pool, they compete for the same resources and interfere with each other’s efficiency. A long prefill processing a 32K-token document blocks the decode phase for other requests; a batch of short decode steps underutilizes a GPU optimized for compute-heavy prefill.

Dynamo’s disaggregated serving splits these into independently scalable GPU pools. Prefill workers handle prompt processing; decode workers handle token generation. Each pool is sized and scaled independently based on the actual workload profile: you can add more compute-optimized GPUs at prefill spikes without over-provisioning your decode fleet, and vice versa.

Source: github.com/ai-dynamo/dynamo; the README contains an architecture overview diagram suitable for embedding here.

2. KV-Aware Routing

The KV cache is the stored result of the prefill computation; it’s what enables a model to “remember” context across a conversation without reprocessing it from scratch on every request. In a naive multi-worker setup, this cache lives on individual workers, and request routing is typically round-robin or load-based. That means a request that shares a long prefix with a previous request (a common pattern in agentic workflows and chat applications) gets routed to a worker that doesn’t have that KV cache, and the prefill computation happens again from scratch.

Dynamo’s KV-aware router tracks which workers hold which KV cache blocks and routes incoming requests to workers where the maximum prefix overlap exists. According to benchmarks published in the Dynamo repository, this delivers 2x faster time-to-first-token for workloads with significant prompt reuse, demonstrated on Qwen3-Coder 480B.

3. KV Block Manager (KVBM): Multi-Tier Cache Offload

GPU HBM (high-bandwidth memory) is the scarcest resource in LLM serving. The KV cache for long contexts can consume enormous amounts of it. The KV Block Manager addresses this by creating a tiered storage hierarchy for KV cache blocks:

GPU HBM: hot, actively used cache blocks
CPU DRAM: warm blocks, recently evicted from GPU memory
NVMe SSD: cool blocks, less recently used
Remote storage: cold blocks, accessible cluster-wide
When a request needs a KV block that has been evicted from GPU memory, KVBM retrieves it from the next tier rather than recomputing it. This effectively extends the usable context window beyond what GPU memory alone could support. KVBM is currently available for TensorRT-LLM and vLLM backends; SGLang support is in progress.

4. ModelExpress: Weight Streaming for Fast Cold Starts

When a new inference worker replica spins up, it needs to load model weights before it can serve requests. For a model like DeepSeek-V3 (671B parameters), this traditionally takes minutes, making autoscaling slow to respond to demand spikes. Dynamo’s ModelExpress addresses this by streaming weights directly GPU-to-GPU via NIXL (NVIDIA Inference Xfer Library) and NVLink. The result, per Dynamo’s benchmarks, is 7x faster model startup for DeepSeek-V3 on H200 hardware.

5. The Planner: SLA-Driven Autoscaling

The Planner is Dynamo’s SLA-aware autoscaling component. Rather than scaling purely on utilization metrics (CPU%, queue depth), the Planner profiles workload characteristics and scales to meet user-defined latency targets: time-to-first-token (TTFT) and inter-token latency (ITL), at the minimum infrastructure footprint. According to benchmarks from Alibaba’s APSARA 2025 deployment published in the Dynamo documentation, the Planner achieved 80% fewer SLA breaches at 5% lower TCO compared to utilization-based autoscaling.

6. Grove: Kubernetes-Native Gang Scheduling

Grove is Dynamo’s Kubernetes operator for topology-aware gang scheduling, designed specifically for NVLink-connected GPU architectures like the GB200 NVL72. When a distributed inference job requires multiple GPUs working together, placement matters: GPUs that communicate via NVLink are orders of magnitude faster than those communicating across PCIe or network fabric. Grove places workloads optimally across racks, hosts, and NUMA nodes to maximize NVLink utilization.

7. AIConfigurator: Zero-Config Deployment

Introduced in Dynamo 1.0, AIConfigurator is a deployment optimizer that simulates over 10,000 deployment configurations in seconds to find the optimal serving configuration for a given model, hardware target, and SLA. Combined with the Planner, it enables what Dynamo calls “zero-config deploy” (currently in beta as DGDR, the Dynamic Graph Deployment Request): you specify the model, hardware, and SLA in a single YAML manifest, and Dynamo auto-profiles the workload, optimizes the topology, and deploysk.

# Zero-config deploy: specify model + SLA, Dynamo handles the rest
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
name: my-model
spec:
model: Qwen/Qwen3-0.6B
backend: vllm
sla:
ttft: 200.0 # ms, time to first token target
itl: 20.0 # ms, inter-token latency target
autoApply: true

Verified Performance Numbers

The following results are sourced directly from the Dynamo GitHub repository README and represent benchmarks published by third-party organizations.

ResultContextSource
7x higher throughput per GPUDeepSeek R1 on GB200 NVL72 with Dynamo boosting performance on GB200 NVL72 using disaggregated serving combined with wide expert parallelInferenceX benchmark
7x faster model startupModelExpress weight streaming and checkpoint restore, DeepSeek-V3 on H200NVIDIA / Dynamo team
2x faster time-to-first-tokenKV-aware routing, Qwen3-Coder 480BBaseten benchmark
80% fewer SLA breachesPlanner autoscaling at 5% lower TCOAlibaba APSARA 2025
25x faster inferenceDeepSeek-R1 on GB300 NVL72 vs. H200InferenceXv2 benchmark

Important context

These numbers represent best-case benchmarks on specific hardware configurations, particularly NVIDIA’s latest Blackwell and GB200/GB300 NVLink systems. Real-world results will depend on your model, hardware generation, request mix, and workload characteristics. The Dynamo repository includes a benchmarking guide for evaluating different deployment topologies in your own environment.

Where ClearML Fits: The Operational and Security Layer

NVIDIA Dynamo solves the distributed inference coordination problem. What it doesn’t provide is the enterprise operational layer: the access control, multi-tenant governance, observability, lifecycle management, and security integration that production enterprise deployments require.

This is what ClearML’s integration with Dynamo delivers. ClearML acts as the control plane on top of Dynamo, providing the following.

  • Access Control and SSO
    Centralized access management with SSO integration. Teams and business units get scoped access to inference endpoints without manual credential management per Dynamo deployment.
  • Multi-Tenant Governance
    Deploy and manage Dynamo-powered inference in multi-tenant environments through the Platform Management Center. Each team or business unit operates within defined resource boundaries with full isolation at the platform level.
  • Observability Dashboard
    Unified visibility into inference workload utilization, model performance, and resource consumption across all Dynamo deployments, without requiring teams to instrument each deployment separately.
  • Infrastructure Abstraction
    Teams deploy models through a single interface without manually coordinating multi-node GPU communication, network configuration, or routing logic. ClearML handles the infrastructure coordination layer. Dynamo-powered inference workloads are deployed and managed as ClearML Apps, the same interface used for all managed AI workloads on the platform.
  • Infrastructure Autoscaling
    ClearML’s autoscaler provisions and terminates cloud GPU workers in response to queue demand, complementing Dynamo’s Planner. Dynamo’s Planner optimizes inference throughput and latency SLAs within a running cluster; ClearML’s autoscaler controls whether the underlying worker nodes exist at all, scaling the fleet up when demand arrives and spinning it down when idle to reduce cost.It’s worth distinguishing ClearML’s infrastructure autoscaling from Dynamo’s built-in Planner: the Planner manages request routing and replica sizing within a deployed inference cluster to meet latency SLAs; ClearML’s autoscaler operates one level below, provisioning or terminating the GPU worker nodes that host those clusters based on queue depth and usage, ensuring you’re only paying for infrastructure that’s actively needed.
  • Model Registry and Versioning
    ClearML’s model registry tracks which model version is deployed to which inference endpoint, with full lineage back to the training run that produced it. Teams can promote models to production and roll back to previous versions through the same interface used for all ClearML workloads.
  • Policy Alignment
    Inference workloads are deployed within organizational policies: resource quotas, access rules, and compliance requirements enforced at the platform level through ClearML Resource Policies, not per-deployment.

The practical effect is that platform teams can expose Dynamo’s distributed inference capabilities to developers without requiring those developers to understand or manage the underlying infrastructure. A data science team can deploy a multi-node Dynamo inference service the same way they’d launch any other ClearML workload, through the UI or API, with the ClearML Agent handling the underlying execution and the complex distributed coordination managed transparently.

Enterprise Use Cases

Agentic AI Applications Requiring High Throughput

Agentic workflows (AI systems that plan, reason, and call tools across multiple steps) generate a fundamentally different inference pattern than simple question-answering. A single user interaction may trigger dozens of model calls, many of which share substantial prompt context: the system prompt, tool definitions, and conversation history. Without KV-aware routing, each of those calls recomputes the shared prefix from scratch.

Dynamo’s KV-aware routing and disaggregated serving make it well-suited for agentic workloads at scale: shared context is cached and routed intelligently, prefill and decode are independently scaled to the actual demand profile, and the Planner ensures latency SLAs are maintained under variable load. ClearML provides the multi-tenant access control and observability to operate these deployments across teams in a production environment.

Multi-Tenant Enterprise Inference Platforms

Platform teams serving inference to multiple business units face a recurring challenge: how to provide each team with reliable, performant access to large models while maintaining security isolation and cost accountability. Running separate model deployments per team is expensive and operationally complex; running a shared deployment without governance creates security and quota enforcement problems.

ClearML’s integration with Dynamo addresses this by sitting at the governance layer above a shared Dynamo inference deployment. Platform teams configure resource boundaries, access controls, and routing policies through ClearML; business unit teams interact with inference endpoints within their allocated scope, without visibility into or interference with neighboring tenants’ workloads.

Hybrid and On-Premises Inference

Many enterprises operate AI workloads across on-premises GPU clusters and cloud environments simultaneously, often because training data governance requirements mandate on-premises processing, while burst workloads benefit from cloud elasticity. Dynamo’s architecture is infrastructure-agnostic at the inference engine level; ClearML provides the unified operational framework that manages inference workloads consistently across both environments from a single interface.

Very Large Model Serving (400B+ Parameters)

Models like DeepSeek-R1 (671B parameters) or Qwen3-Coder 480B cannot fit on a single GPU node. Serving them requires coordinated multi-node deployments where tensor parallelism and pipeline parallelism are carefully managed across GPU interconnects. Dynamo’s Grove operator handles the Kubernetes-level placement optimization for NVLink-connected systems; ClearML provides the deployment interface that abstracts this complexity from the teams launching inference workloads.

How the Integration Works in Practice

From a deployment workflow perspective, ClearML wraps the Dynamo deployment lifecycle. A platform team using the ClearML + Dynamo integration would typically follow these steps.

  1. Configure the Dynamo deployment parameters: through ClearML: model selection, backend (SGLang, TensorRT-LLM, or vLLM), hardware targets, and the queue the workload will run on.
  2. Set access and resource policies: through ClearML’s user and group management: define which teams can invoke the inference endpoint and apply resource quotas to govern GPU consumption across tenants.
  3. Deploy through the ClearML interface: ClearML handles the Kubernetes orchestration, Dynamo configuration, and multi-node coordination without requiring manual YAML management or kubectl operations per deployment.
  4. Monitor from the ClearML dashboard: inference utilization, request latency, GPU consumption, and model performance are visible centrally across all deployed Dynamo endpoints.
  5. Scale or update as needed: through the same ClearML interface, without manual intervention in the underlying Dynamo infrastructure.

Backend support

Dynamo 1.0 supports three inference engine backends: SGLang, TensorRT-LLM, and vLLM. Feature availability varies by backend: disaggregated serving, KV-aware routing, and the SLA-based Planner are supported across all three; KVBM is currently available for TensorRT-LLM and vLLM. The full feature matrix is available in the Dynamo repository.

What This Means for AI Infrastructure Teams

The combination of NVIDIA Dynamo and ClearML addresses a gap that has become increasingly visible as enterprises try to move large model inference from prototype to production: the infrastructure is complex, the operational requirements are demanding, and the two rarely come pre-integrated.

Dynamo’s contribution is a rigorous solution to the distributed inference coordination problem: disaggregation, KV-aware routing, multi-tier caching, and SLA-driven scaling, all built on a production-grade, open-source foundation with demonstrated results at scale.

ClearML’s contribution is the operational and governance layer that makes Dynamo deployable in organizations that have security requirements, multi-team environments, and the expectation that infrastructure should be managed through consistent, auditable workflows rather than per-deployment scripts.

“Managing distributed inference at scale is incredibly complex, but it doesn’t have to be. NVIDIA Dynamo delivers breakthrough performance improvements, and our platform ensures enterprises can securely deploy those capabilities in multi-tenant environments without the typical infrastructure and security headaches.”

–Moses Guttmann, Co-founder and CEO, ClearML

For platform teams, this means the capability to offer large-scale inference as an internal service, with the access controls, observability, and governance that service-level operations require, without building and maintaining that operational layer in-house.

Getting Started

Support for NVIDIA Dynamo is available now as part of the ClearML platform. Relevant resources:

Facebook
Twitter
LinkedIn
Scroll to Top