Securing Production Model Serving with ClearML’s AI Application Gateway

April 14, 2026

By Adam Wolf and Damian Erangey

When a model moves to production, the security requirements change. You are no longer protecting a development workflow; you are protecting a live API that accepts input from the outside world. This blog covers how ClearML’s AI Application Gateway handles routing, authentication, and access control for production endpoints, and what that means for IT directors responsible for the infrastructure behind them. It accompanies our Enterprise AI Infrastructure Security YouTube series. Watch the corresponding video here.

Production Is Different

Everything covered in the previous entries in this series – identity, RBAC, vaults, service accounts, compute governance – protects your development environment. It is the place where teams experiment, train, and iterate; and this is valuable, but it’s contained. The blast radius of a misconfiguration is limited to your internal platform.

When you serve a model, you’re exposing an API endpoint to consumers. Internal applications, partner integrations, customer-facing products. That endpoint becomes a direct attack surface; it accepts input from the outside world, and it needs to answer for it.

For IT directors, production serving raises a specific set of questions: Who can access our endpoints? How do we authenticate API consumers? What happens if someone deploys an untested model? How do we know exactly what’s running in production and how it got there? Furthermore, these are not just security questions; they are also audit and compliance questions.

The AI Application Gateway

The AI Application Gateway is ClearML’s secure front door for production services. It sits between your deployed models and the outside world, and it handles four things that would otherwise require separate systems to build, configure, and maintain:

  • Routing: the gateway detects tasks that have registered for routing on your compute nodes and creates network routes to them. Both HTTPS and raw TCP routing are supported.
  • Authentication: every request requires a valid token. Invalid or expired tokens are rejected before they reach your model. There is no anonymous access through the gateway.
  • RBAC: access is controlled at the group level. Users not in the authorized group are rejected even if their token is valid.

Without the gateway, making a model externally accessible means manually configuring networking, SSL certificates, load balancers, and Kubernetes YAML, and that is just the networking. Authentication, RBAC, and observability are each a separate system on top. The gateway replaces all of it with ClearML’s security controls built in.

Deployment

The gateway is a separate component deployed alongside your ClearML Server, via Helm on Kubernetes or Docker Compose. Full installation details are in the deployment documentation. It’s namespace-scoped; it can only detect and route to tasks within its namespace, which matters for environment separation.

In practice, this means your production serving environment and your development environment each get their own gateway instance, with their own routing, their own static routes, and their own access tokens. A misconfiguration or token leak in the dev namespace cannot expose endpoints in production. The isolation is enforced at the infrastructure level.

The gateway authenticates back to the ClearML server using admin-level API credentials. Those credentials should be created with a dedicated service account, clearly labelled, and not shared with anything else. Once deployed, you can verify the gateway is functioning from Settings → Application Gateway → Routers — running a connectivity test launches a small task and confirms the gateway can detect it and create a route to it.

Ephemeral Routes vs. Static Routes

The gateway supports two types of routes. Understanding the difference is important for deciding how to structure production access.

Ephemeral routes are created automatically when a deployment launches and the AI Gateway Route field is left blank. They are secured by token authentication and scoped at the project level, meaning access rules for that project apply. For development and testing, this is adequate. For production, they lack the per-endpoint RBAC, stable URLs, and load balancing that static routes provide.

Static routes are administrator-defined, persistent endpoints that decouple the external URL from the specific model instance behind it. They are configured in Settings → Application Gateway → Static Routes.

Static routes give you four things ephemeral routes do not:

  • A stable URL: the endpoint address doesn’t change when you redeploy, swap models, or scale up. Consumers point to one address regardless of what’s behind it.
  • Per-endpoint RBAC: you specify which ClearML groups can access this endpoint. Authenticated users outside those groups are rejected even with a valid token.
  • Lifecycle independence: the route persists even when the model is temporarily down for maintenance or redeployment. Consumers don’t need to update anything.
  • Load balancing with session affinity: when running multiple instances behind a route, the gateway distributes traffic across them while maintaining session context. Once a consumer’s first request is routed to a specific instance, all subsequent requests from that consumer continue going to the same instance.

Session affinity matters specifically for LLM serving. Inference engines like vLLM and SGLang maintain a KV cache of previous tokens, routing the same consumer to the same instance keeps that cache warm and response times consistent. It also gives you predictable behavior for debugging: if a consumer reports an issue, you know which instance was serving them.

Both URL-path and subdomain routes are supported. A URL-path route places your endpoint under a path on your gateway domain (e.g. /llm-inference). A subdomain route gives it a dedicated address (e.g. llm-inference.gateway.yourcompany.com). The choice depends on how you want to structure your external URLs.

Internal vs. External Serving

Not all serving scenarios carry the same risk, and the security model should reflect the difference.

For internal serving, where consumers are ClearML users, you get defense in depth. The gateway checks that the token is valid and that the user belongs to an authorized group. Two layers of authentication before anything reaches the model. Project-level access rules also apply: only users with access to the deployment’s project can see or manage it.

For external serving, where consumers are applications, partner integrations, or customer-facing products, your consumers are not ClearML users. They don’t have group membership. The token becomes the sole gatekeeper.

That means token hygiene becomes critical for external endpoints: shorter expiration windows, tighter rotation schedules, and immediate revocation when a partner is offboarded or an integration is decommissioned. The infrastructure is identical in both cases. The same gateway, the same static routes, the same compute governance. What changes is the access model.

You can run both side by side: an internal endpoint for your data science team to test against, and an external endpoint for your production application, both behind governed routes on the same platform.

Token-Based Authentication

Access tokens are managed in Settings → Application Gateway → Access Tokens. Generating a token requires a label and an expiration period in days. The token is shown once at generation (it cannot be retrieved afterwards). The token table shows each token’s label, creation date, expiration date, the user or service account it grants access as, and who created it. Revoking a token is immediate: hover over the row and click revoke. Access is cut off instantly; thus, the model does not need to restart, and the endpoint does not need to go down.

Token expiration is a fundamental control. Token leakage is a matter of when, not if: at some point a token will be accidentally committed to a repository, shared in a support ticket, or left in a config file on a decommissioned machine. Expiration bounds the window of exposure. A token that expires in 30 days has a 30-day blast radius. A token that never expires has an unlimited one.

Practical expiration guidelines based on consumer type:

  • Internal applications: 30–90 days
  • External partners: 30 days
  • Testing and development: 7 days
  • One-time demos: 24 hours

Never create a token without an expiration date. Compliance frameworks including SOC 2 typically require credential expiration, so this is also a prerequisite for audit readiness, not just good practice.

Deploying a Model

ClearML provides purpose-built deployment applications for serving LLMs and other model types: vLLM Model Deployment, SGLang Model Deployment, Llama.cpp, Embedding Model Deployment, and NVIDIA NIM, all available under the Enterprise plan. Each deployment app presents the same set of security-relevant fields at launch time:

  • Service Project: ties the deployment to a ClearML project. Project-level access rules apply: only users with access to that project can see or manage the deployment.
  • Queue: determines which compute the deployment runs on, governed by the resource policies configured in the previous post. The deployment inherits the compute governance layer from queue assignment.
  • AI Gateway Route: selecting a pre-created static route connects the deployment to a governed, RBAC-controlled, stable endpoint. Leaving this field blank creates an ephemeral route with project-level scoping only.

From a security perspective, these three fields are the ones that matter most before the model even starts: project access, compute governance, and gateway routing. The model configuration follows, but the security posture is set by those selections.

The Gateway Covers More Than Models

The gateway isn’t only for model endpoints. Every ClearML application that needs external access goes through the same gateway: the same token authentication, the same RBAC, the same SSL-secured routing, the same static routes, regardless of what’s behind the endpoint.

This includes:

  • Remote development environments: VS Code, JupyterLab, SSH Sessions
  • Model UIs: Gradio Launcher, Streamlit Launcher, LLM UI
  • Containerized Application Launcher for custom containers
  • NVIDIA NIM containers
  • Vector database sessions: Milvus, Qdrant

The security model is the same for all of them. What changes is the workload, not the governance. That means an organization does not need to build separate access control systems for each application type (e.g., the same token management, RBAC configuration, and static routing apply across the board).

Observability: The Model Endpoints Dashboard

The Model Endpoints dashboard provides a live view of every active endpoint: uptime, total request count, average requests per minute, and average latency. Clicking into an endpoint opens the Monitor tab, which shows request volume over time broken down by token, so you can see not just that an endpoint is live, but how much traffic it is handling and which consumers are responsible for it.

This is the visibility layer that makes the security controls actionable. If a token is generating unusual traffic patterns, you’ll see it here and can revoke immediately. If an endpoint is idle and shouldn’t be, you’ll see that too. Governance without observability is incomplete; you need to be able to see what’s happening in order to act on it.

Five Layers, One Endpoint

Production model serving is where the security investment made across this series converges. At the moment a consumer hits your endpoint, every layer is active:

And then the gateway: token authentication, group-based RBAC, a stable URL, SSL, and full request observability. No single layer does everything, but together they give you a governed endpoint that is hard to access without authorization, visible when it is, and auditable after the fact.

Closing

The AI Application Gateway is the point where ClearML’s security model meets the outside world. Static routes give you stable, governed endpoints. Token expiration bounds exposure when things go wrong. RBAC ensures that even valid tokens can’t reach endpoints they’re not authorized for. And the Model Endpoints dashboard keeps you informed about what’s running and how it’s being used.

In the next entry in this series, we will look at audit trails and monitoring; how you know all of this is working, and how you answer the auditor’s question: what was running, who had access to it, and what did they do with it?

Learn More

Find the full Enterprise AI Security video series on YouTube. Get in touch if you would like to discuss how ClearML can support your organization’s model serving and production AI security requirements.

Facebook
Twitter
LinkedIn
Scroll to Top