Kubernetes has become the de facto substrate for enterprise AI infrastructure. Its ability to handle complex, long-running workloads, self-healing capabilities, and rich ecosystem of GPU operators, storage drivers, and networking tools make it the natural platform for organizations scaling AI beyond the lab. ClearML sits on top of Kubernetes, adding workload-level scheduling, fractional GPU support, and a full AI development layer to the base infrastructure that eliminates the risk of custom end user YAML files.
As enterprises deploy AI across more teams and business units, a fundamental tension emerges. To build the stacks they need, AI teams often require deep Kubernetes access, e.g., custom operators, specific CRD versions, the ability to create and delete namespaces freely. Granting that level of access on a shared cluster is a serious security risk. The traditional answer has been to provision a separate cluster per team, but that model doesn’t scale: it fragments visibility, wastes compute through idle isolation, and multiplies infrastructure management overhead.
ClearML’s new integration with SUSE Rancher Prime’s k3k resolves this tension directly. By spinning up lightweight virtual Kubernetes clusters inside the host cluster, organizations can give each team genuine cluster-admin access to their own isolated environment, without ever touching the physical infrastructure beneath it.
The Admin Access Paradox
In a traditional shared Kubernetes cluster, granting cluster-admin permissions to a team is a significant security event. A single misconfiguration, or a compromised workload, could expose the physical nodes, reach sensitive volumes, or disrupt neighboring tenants. So IT typically withholds that level of access and manages environment changes centrally.
The problem is that many AI workloads have legitimate reasons to need it:
- Installing specific versions of distributed training operators (e.g., a specific NVIDIA GPU Operator) with non-default configurations that differ from what is deployed cluster-wide.
- Deploying custom Custom Resource Definitions (CRDs) that aren’t available cluster-wide
- Needing to create, modify, or destroy namespaces and network policies on their own schedule, without waiting on IT tickets
The outcome is predictable: either IT becomes a bottleneck (every environment change takes days or weeks), or teams find workarounds that quietly erode the security model. Neither is acceptable at enterprise scale. RBAC, while essential, only controls who can do what within a shared cluster. It cannot replicate the full isolation of dedicated infrastructure.
Virtual Clusters: The Architecture That Removes the Trade-off
K3k (Kubernetes-in-Kubernetes) is an open source project from SUSE that creates lightweight nested Kubernetes clusters, “child” clusters that run as pods on a “parent” host cluster. Each child cluster has its own API server, its own control plane, and its own isolated namespace within the parent Kubernetes cluster. To the team using it, the experience is indistinguishable from having a dedicated bare-metal cluster. To the host cluster, it is just a set of pods running in a namespace.
ClearML’s integration automates the full lifecycle of these virtual clusters, i.e., provisioning, resource allocation, scaling, and teardown, through a clean interface that does not require teams to manage k3k directly.
A clean split between management and execution
The security model is built around two distinct layers that never overlap:
- The management layer (parent cluster) is owned by IT. It controls the physical nodes, actual storage drivers (CSI), physical networking (CNI), GPU access, and resource quotas. ClearML’s infrastructure control plane operates here, providing governance, scheduling, and cross-tenant visibility.
- The virtual layer (child cluster) is handed to the team. They are cluster-admin here, free to install operators, define CRDs, and manage namespaces as they see fit. But their “nodes” are actually pods in the parent cluster. They cannot see the physical infrastructure beneath them, cannot reach other tenants’ virtual clusters, and cannot access the host VPC or management plane.
If a pod inside a virtual cluster is compromised, the attacker is contained within the cluster’s containerized API server boundary: no lateral movement to the physical host, no visibility into neighboring tenants.
One mode, fully controlled
ClearML’s integration with k3k uses virtual mode exclusively, the more secure of k3k’s two deployment options. Each tenant’s workloads run in their own fully encapsulated virtual cluster with its own API server, completely isolated from neighboring tenants at every layer.
Resource allocation is controlled by the parent cluster administrator, who can assign more or less compute to a given virtual cluster as needs change. This is a deliberate, managed operation rather than automatic elastic scaling, giving IT explicit control over how physical resources are distributed across tenants, without any resource sharing happening beneath the surface.

How It Works In Practice
From the team’s perspective, the experience is straightforward. From IT’s perspective, governance never breaks:
- A team requests a virtual lab through the ClearML interface: choosing a queue for the virtual cluster’s k3s to run on and a queue to route workload pods into.
- ClearML provisions isolated virtual Kubernetes clusters within SUSE Rancher Prime environment.
- The team receives a kubeconfig for their virtual cluster and full cluster-admin access within it. They install whatever operators and CRDs they need. To them, it feels like a dedicated cluster.
- As workloads grow, ClearML autoscales the underlying k3k pods across additional physical nodes in the parent cluster. IT’s resource quotas set hard limits so no single tenant can exhaust the shared GPU pool.
- Cleanup is automated – full lifecycle logging, teardown workflows, and cost attribution tracked through ClearML’s management console.
Summary: what each layer controls
| Capability | Team / Tenant Admin | IT / Host Cluster Admin |
|---|---|---|
| Permissions | Full cluster-admin in virtual env | Controls access to host nodes and kubelet |
| Networking | Custom Ingress and Services within tenant limits | Traffic governed by host CNI and predefined access rules |
| Storage | Pre-mounted virtual volumes as needed | Virtual volumes backed by host-managed PVCs |
| Deployment | One-click via ClearML UI or API | ClearML manages provisioning, teardown, and audit logging |
| Scaling | Cluster grows on demand with workload | IT resource quotas enforce hard limits on GPU and CPU |
Who This Is For
This architecture addresses a specific gap that conventional multi-tenancy approaches don’t fill: teams that need genuine Kubernetes autonomy, operating within an organization that can’t afford to give up infrastructure control. Here are three scenarios where this matters most:
- Enterprises with strong BU isolation requirements: organizations where different business units handle data subject to different regulatory regimes, or where the security model requires that one team simply cannot see another’s workloads at any layer. Namespace-level RBAC cannot guarantee this; virtual clusters can.
- Research and advanced development teams: teams experimenting with custom training stacks, new operators, or infrastructure configurations that aren’t standardized across the organization. They need a sandbox with real cluster control, but the organization needs to ensure experiments can’t escape into production infrastructure.
- Cloud service providers and managed AI platforms: CSPs offering AI infrastructure to multiple enterprise clients from shared physical hardware. Each client gets what feels like a dedicated Kubernetes cluster; the provider maintains centralized governance, billing visibility, and compute efficiency across the entire fleet.
Availability
ClearML’s integration with SUSE Rancher Prime’s k3k is available now for enterprise customers on SUSE Rancher Prime. Organizations can review the joint technical reference documentation or request a demo at clear.ml/demo.