By Adam Wolf
Role-based access control is essential, but it’s not isolation. When multiple AI teams share a Kubernetes cluster, RBAC controls what they can do; it doesn’t control what they can reach, what they can see, or what happens when something goes wrong in a neighboring workload.
This is the first post in our four-part series on Kubernetes Security for Enterprise AI Environments. The series covers tenant isolation, secrets and credential management, GPU resource governance, and production model serving security.
The Multi-Tenancy Challenge in Enterprise AI
Kubernetes was designed for running applications efficiently on shared infrastructure. That efficiency comes from a fundamental architectural assumption: the cluster is a shared resource, and all workloads on it, regardless of namespace or access policy, operate within the same shared control plane.
For most application workloads, this is a reasonable trade-off. For enterprise AI workloads, it creates a category of security risk that access controls alone cannot resolve.
Consider what a typical enterprise AI platform team is managing on a shared Kubernetes cluster inside a large bank: a retail banking team building credit risk models, a markets team running pricing and forecasting workloads, a fraud team training on transaction data, and an internal operations team fine-tuning LLMs over support and compliance documents. Each business unit operates under different internal controls, and regulatory regimes (PCI-DSS, SOC 2, GDPR) mandate that data processed by one team cannot be accessible to another, even accidentally.
The standard response to this requirement is Kubernetes RBAC: define roles, bind them to service accounts, and scope permissions to namespaces. It’s the right starting point.
What RBAC Actually Controls (and What It Doesn’t)
Kubernetes RBAC (Role-Based Access Control) is an authorization mechanism that controls which API operations a given identity can perform on which resources. A ClusterRole defines a set of permissions. A RoleBinding or ClusterRoleBinding assigns those permissions to a user, group, or service account.
A well-configured RBAC policy can prevent a member of Team A from listing pods in Team B’s namespace, deleting Team B’s ConfigMaps, or reading Team B’s Secrets. These are meaningful controls. But RBAC operates at the Kubernetes API layer. Below the API layer, several significant blind spots remain regardless of RBAC configuration.
-
Shared API Server
Every tenant in a shared Kubernetes cluster communicates with the same API server. That API server has a single etcd backend that stores all cluster states, including Secrets. If the API server has a vulnerability, or if a pod escapes its container boundary and can reach the API server directly, RBAC bindings may be bypassed entirely. There is no architectural separation between tenants at the control plane level.
-
Node-Level Exposure
Kubernetes schedules pods onto nodes. Without explicit node affinity rules and taints/tolerations, pods from different tenants land on the same physical or virtual machines. A container escape vulnerability (CVE-2019-5736 in runc, CVE-2022-0185 in the Linux kernel, and others) can give an attacker access to the host node’s filesystem, process table, and network interfaces, from which all other workloads on that node are visible. RBAC has no bearing on what happens after a container escape.
-
Network Reachability
By default, Kubernetes networking allows any pod to communicate with any other pod in the cluster across namespaces. Without explicit NetworkPolicy resources (which most clusters leave unconfigured), a workload in one team’s namespace can make direct network calls to endpoints in another team’s namespace. For AI workloads that expose internal APIs, model serving endpoints, or vector database interfaces, this is a real lateral movement path: an attacker who compromises one team’s pod can probe and potentially exfiltrate data from another team’s services without ever touching the Kubernetes API or triggering RBAC controls.
The Limits of Namespace Isolation
Namespaces in Kubernetes are a logical scoping mechanism, not a security boundary. The Kubernetes documentation is explicit on this point:
From the Kubernetes documentation
Namespaces provide a mechanism for isolating groups of resources within a single cluster. Names of resources need to be unique within a namespace, but not across namespaces. Namespace-based scoping is applicable only for namespaced objects (e.g. Deployments, Services, etc.) and not for cluster-wide objects (e.g. StorageClass, Nodes, PersistentVolumes, etc.).
Cluster-scoped resources, including Nodes, PersistentVolumes, StorageClasses, ClusterRoles, and CustomResourceDefinitions, exist outside of namespace boundaries entirely. A tenant with the ability to enumerate or modify cluster-scoped resources has visibility that no namespace-level RBAC policy can restrict.
Where AI Workloads Add Pressure
The risks above apply to any multi-tenant Kubernetes environment. But AI workloads have characteristics that amplify the exposure in specific ways.
Models and Data Carry Real Value
AI teams routinely work with proprietary model weights, fine-tuned checkpoints, and training datasets that represent significant investment and competitive advantage. In many industries, the underlying data carries legal protection (PHI, PII, financial records). In a shared cluster, a misconfigured pod security policy or an overly permissive service account can expose PersistentVolume mounts or Secret contents to workloads they were never intended to reach.
AI Workloads Frequently Need Elevated Access
This is where AI infrastructure gets particularly difficult to secure: the workloads that most need to run on shared infrastructure are often the ones that require cluster-scoped permissions to function. Distributed training operators (the NVIDIA GPU Operator, custom CRDs for orchestrating multi-node jobs) typically install with ClusterRoles rather than namespace-scoped Roles, because they need to watch resources, manage device plugins, or create pods across the entire cluster. Granting a team the operator their workload legitimately needs often means installing a controller that, by design, reaches well beyond that team’s namespace.
The “Permission Creep” Response
When platform teams restrict access appropriately but can’t provision environments fast enough, AI teams route around the controls. They run workloads on ad hoc infrastructure, deploy models without going through the platform, or request service accounts with overly broad permissions because the narrow ones don’t cover their use case. The result is a security posture that looks strong in the RBAC configuration but is quietly porous in practice.
The core tension
Giving AI teams the Kubernetes access they need to build production stacks on a shared cluster creates security risk. Restricting access to what’s safe on a shared cluster makes platform teams a bottleneck and pushes teams toward workarounds. This is the problem that namespace-level RBAC cannot solve, because it’s an architectural problem, not a policy problem.
The Approaches That Don’t Fully Solve It
Before covering the architecture that does work, it is worth being precise about why the common alternatives fall short.
Tighter RBAC Policies
Better RBAC is always worth doing. The Kubernetes RBAC good practices guide is a useful reference: avoid wildcard permissions, prefer Roles over ClusterRoles, audit and remove unused bindings, and don’t use the default service account. These reduce the attack surface but do not address the shared API server, node-level, or network exposure vectors described above.
One Cluster Per Team
True cluster-level isolation: provisioning a separate Kubernetes cluster for each team or business unit, does solve the isolation problem. It is also operationally expensive: each cluster requires its own control plane, its own upgrade cycle, its own monitoring stack, and its own GPU pool. GPU capacity cannot be shared across clusters, meaning idle capacity in one team’s cluster cannot serve another team’s workload. At scale, the management overhead and stranded compute cost are significant enough that most organizations cannot or do not want to sustain this model.
Pod Security Standards and Admission Controllers
Kubernetes Pod Security Standards (which replaced PodSecurityPolicy in Kubernetes 1.25) restrict what pods can do: running as root, using host namespaces, mounting sensitive host paths. These are important hardening controls. They reduce what a compromised workload can do but do not prevent a workload in one namespace from reaching network endpoints in another, or from benefiting from any vulnerabilities in the shared API server or underlying node.
The Architecture That Does Work: Kubernetes-in-Kubernetes
The approach that resolves the architectural tension, rather than working around it, is virtual cluster provisioning. Each tenant gets their own Kubernetes cluster, with its own API server and control plane. But instead of running on separate physical infrastructure, these “child” clusters run as workloads inside a shared “parent” cluster.
This is what SUSE’s k3k (Kubernetes-in-Kubernetes) implements. A k3k virtual cluster runs as a set of pods on the host cluster. It has its own API server, its own scheduler, its own controller manager. A team that receives a kubeconfig for their virtual cluster has full cluster-admin access within it, they can install operators, define CRDs, manage namespaces, configure network policies, all without any of those actions having any effect on the host cluster or any other tenant’s virtual cluster.

The Security Boundary
The containment model works because the virtual cluster’s API server runs inside a container on the host cluster. If a pod inside the virtual cluster is compromised and attempts to reach the Kubernetes API, it reaches the virtual cluster’s API server, not the host cluster’s. There is no path from the virtual cluster’s pods to the host cluster’s control plane, other tenants’ namespaces, or the host node’s management interfaces.
The host cluster’s infrastructure layer remains under IT control: physical nodes, storage drivers (CSI), networking (CNI), and GPU resources are all provisioned and governed at the parent cluster level. The tenant sees virtual nodes (which are actually pods in the parent), virtual storage, and a virtual network, all backed by real resources that IT controls.
What Tenants Can Do
Within their virtual cluster, a team has genuine cluster-admin capability:
- Install any operator or CRD version their workload requires, including specific NVIDIA GPU Operator configurations that differ from the cluster-wide standard
- Create, modify, and delete namespaces on their own schedule without opening IT tickets
- Define custom network policies, Ingress resources, and service meshes scoped to their environment
- Run workloads that require elevated RBAC permissions without those permissions reaching the host cluster
What IT Controls
At the host cluster level, the platform team retains complete control:
- GPU and CPU resource allocation per virtual cluster, enforced at the host level, not dependent on tenant cooperation
- Network traffic between virtual clusters, governed by host CNI policies
- Storage access, via pre-authorized PersistentVolumeClaims that the tenant can use but cannot escalate beyond
- Full audit logging of all host-level events, regardless of what happens inside tenant virtual clusters
How ClearML Operationalizes This Architecture
Virtual cluster provisioning solves the isolation problem architecturally. What it doesn’t solve on its own is the operational problem: provisioning clusters manually, managing GPU passthrough, configuring network access, and tracking resource consumption across dozens of virtual clusters is a significant infrastructure management burden.
ClearML’s integration with SUSE k3k on SUSE Rancher Prime / RKE2 automates this operational layer. From the ClearML interface, a platform administrator can:
- Provision a new virtual cluster for a team in minutes, with GPU passthrough and network configuration handled automatically
- Assign the virtual cluster to a ClearML queue, so AI workloads are routed into the correct isolated environment
- Set resource quotas through ClearML Resource Policies that are enforced at the host cluster level
- Monitor utilization and costs across all virtual clusters through the Platform Management Center
The team requesting the environment gets a kubeconfig and full admin access to their virtual cluster. They experience it as a dedicated cluster. The platform team experiences it as one entry in a centrally managed, quota-governed, auditable list of tenant environments.

| Capability | RBAC only | Separate cluster per team | ClearML + k3k virtual clusters |
|---|---|---|---|
| API server isolation | No: shared | Yes | Yes: own API server per tenant |
| Node-level isolation | No | Yes | Partial: host nodes shared, containment at pod boundary |
| Network isolation | Only with NetworkPolicy | Yes | Yes: host CNI governs inter-tenant traffic |
| Team cluster-admin access | No: security risk on shared cluster | Yes | Yes: scoped to virtual cluster |
| Shared GPU pool | Yes | No: fragmented per cluster | Yes: IT-governed quotas |
| Centralized governance | Partial | Fragmented, N clusters to manage | Yes: single ClearML control plane |
| Provisioning time | Fast | Days to weeks | Minutes via ClearML UI |
| Idle GPU cost | None | High: stranded per cluster | None: shared pool with quotas |
Practical Recommendations
Regardless of whether you adopt virtual clusters immediately, the following controls represent the baseline for any Kubernetes cluster running AI workloads with multiple teams or data governance requirements.
Do These First
- Audit your existing RBAC bindings. Remove wildcard permissions (verbs: [“*”]), unused ClusterRoleBindings, and overly broad service account permissions. Use kubectl auth can-i –list from a low-privilege context to understand what each service account can actually reach.
- Apply NetworkPolicies. Start with a default-deny policy in each namespace and explicitly allow only the traffic your workloads require. Without NetworkPolicies, any pod can reach any other pod regardless of RBAC configuration.
- Enable Pod Security Standards. At minimum, enforce the baseline profile across all namespaces. This prevents the most common container escape vectors: running as root, using host namespaces, mounting sensitive host paths.
- Never put credentials in environment variables or ConfigMaps. Use Kubernetes Secrets at minimum; use a secrets management solution (covered in the next post in this series) for anything production-critical.
For True Multi-Tenant Isolation
If your organization has teams with different data governance requirements operating on the same cluster, RBAC hardening is necessary but not sufficient. The architectural requirement is separate API servers per tenant boundary. Virtual clusters via k3k on ClearML give you that separation without the cost and fragmentation of separate physical clusters.
Get in touch if you would like to discuss how ClearML can support your organization’s AI security requirements.