ClearML + NVIDIA Cosmos: ClearML Launches One Platform for NVIDIA Cosmos™ Deployment and the NVIDIA Video Search & Summarization Blueprint

March 16, 2026

ClearML’s out-of-the-box NVIDIA NIM integration brings NVIDIA Cosmos Reason 2 into production in minutes, providing the complete infrastructure, orchestration, vector database, and security stack to run NVIDIA Video Search & Summarization blueprint at enterprise scale.

NVIDIA Cosmos: Deployed Out of the Box with ClearML as NVIDIA

NIM Microservices

NVIDIA Cosmos Reason 2 is an open, customizable reasoning Vision-Language Model (VLM) for physical AI — robots, autonomous vehicles, and industrial vision systems. Built on Qwen3-VL-8B-Instruct, it combines a Vision Transformer encoder with a dense LLM to deliver long chain-of-thought physical reasoning: understanding causality, trajectories, 3D space, and temporal sequences across a 256K-token context window (16× its predecessor).

Cosmos Reason 2-8B
Parameters8 B (2 B variant also available)
ArchitectureViT encoder + Dense Transformer (Qwen3-VL-8B-Instruct)
Context window256,000 input tokens
Key capabilitiesChain-of-thought reasoning, Temporal localization, 2D/3D localization
Minimum GPU1× NVIDIA L40S (also RTX 4500 Pro, RTX 6000 Pro, H100, H200)
PackagingNVIDIA NIM microservice — vLLM optimized, OpenAI-compatible API

Why ClearML — Not Raw Kubernetes

No Helm charts. No manual kubectl. No debugging container networking.

From model card to callable production endpoint in minutes. ClearML’s NVIDIA NIM Application ships with Cosmos Reason 2-8B pre-registered in its model catalog. Select the model, choose a GPU resource pool, and click Launch. ClearML provisions the NIM container, configures networking, and exposes a secured endpoint through its App Gateway.

ClearML NIM Application — What You Get with Every Deployment

  • Catalog-based launch: Cosmos Reason 2-8B, Embedding NIM microservices, LLMs, ASR NIM microservices — all pre-validated
  • Infrastructure-agnostic: bare metal, VM, or Kubernetes — ClearML’s agent layer abstracts it all
  • Dynamic autoscaling: GPU pools expand and contract with queue depth — no idle cost, no latency spikes
  • Multi-engine support: vLLM, SGLang, llama.cpp — choose the right runtime per model
  • App Gateway: every NIM is exposed as an authenticated, RBAC-controlled endpoint — nothing is public by default
  • Observability: real-time GPU utilization, request throughput, and latency — built into the ClearML UI

The NVIDIA Video Search & Summarization Blueprint — on ClearML

NVIDIA’s Video Search & Summarization (VSS) AI Blueprint defines a reference architecture for ingesting live or archived video, captioning it with a VLM, indexing captions in a vector store, and answering natural-language queries against the resulting knowledge base. ClearML provides the production operating layer that turns this blueprint into a governed, monitored, and scalable business system.

NVIDIA Video Search & Summarization Architecture Diagram
NVIDIA Video Search & Summarization Architecture Diagram

The Cosmos Reason 2-8B VLM sits at the heart of the pipeline. Its 256K-token context window allows entire multi-event scenes to be understood holistically rather than truncated, producing richer captions that dramatically improve downstream retrieval precision. NVIDIA NeMo Guardrails filters invalid or out-of-scope prompts at the query layer, protecting system integrity and data privacy before any request reaches the LLM NIM.

Vector Databases: Milvus and Qdrant — Both Managed by ClearML

ClearML deploys and manages both Milvus (the VSS reference default) and Qdrant as first-class managed services — keeping the entire VSS dependency graph inside a single governed platform. Milvus delivers cloud-native horizontal scaling for tens of millions of video captions; Qdrant offers lower-latency filtered semantic search for queries that combine similarity with structured metadata (camera ID, time range, event type). Switching between them is a configuration change — no application code rewrites required. Both inherit full RBAC and network isolation automatically.

Orchestration: ClearML Pipelines End to End

ClearML Pipelines encode the full ingestion and indexing flow — segmentation, VLM captioning, ASR transcription, embedding, and vector indexing — as a versioned, reusable DAG. Workers autoscale on GPU queues based on pipeline queue depth. Execution history, step-level metrics, artifact lineage, and intermediate outputs are tracked automatically, giving teams full reproducibility and root-cause analysis without any extra tooling.

End-to-End Security, Built Into Every Layer

  • Complete tenant isolation: separate SSO/IdP per tenant — projects, models, endpoints, and GPU queues are fully scoped
  • RBAC on assets AND compute: GPU queues are access-controlled objects — only authorized roles can submit workloads
  • Authenticated NIM endpoints: App Gateway enforces auth on every API call — unauthenticated requests never reach the model
  • SSO (SAML 2.0 / OIDC) + LDAP: user lifecycle managed centrally — revoke access in the IdP, access to ClearML disappears instantly
  • Audit trails: full log of who deployed what, when, with which parameters, and who called which endpoint
  • On-premise & air-gapped: full deployment within the customer network perimeter — zero external dependencies

Business Impact

Minutes to Production, Not Weeks.  Deploying VSS on raw infrastructure requires weeks of DevOps work. ClearML’s NIM Application, managed vector DB services, and pipelines compress that to hours — without a single line of infrastructure code.

Cost Efficiency Through Autoscaling.  ClearML scales GPU pools to zero during idle periods and out precisely when needed. Variable video ingestion rates — common in industrial and surveillance workloads — are handled automatically, eliminating over-provisioning costs.

Security Without Compromise.  Sensitive video footage in healthcare, finance, or critical infrastructure demands zero-trust governance. ClearML’s multi-tenant isolation, RBAC, and audit trails deliver that compliance layer without slowing down the development team.

Open, Pluggable Architecture.  No inference engine lock-in (vLLM / SGLang / llama.cpp), no vector DB lock-in (Milvus / Qdrant / VAST), no cloud lock-in. As the AI landscape evolves, teams swap components without redesigning their operational infrastructure.

Facebook
Twitter
LinkedIn
Scroll to Top