ClearML Announces AI Infrastructure Control Plane

August 14, 2024

We are excited to announce the launch of our AI Infrastructure Control Plane, designed as a universal operating system for AI infrastructure. With this launch, we make it easier for IT teams and DevOps to gain ultimate control over their AI Infrastructure, manage complex environments, maximize compute utilization, and deliver an optimized self-serve experience for their AI Builders. Our AI Infrastructure Control Plane improves cluster management, supports high-performance computing, and provides management centers and dashboards to give you more visibility about utilization. Customers can now fully utilize GPUs, down to a fraction of a GPU, while closely controlling performance and costs. 

Scale and Enterprise customers can now benefit from greater control, transparency, and efficiency when managing, scheduling, and orchestrating GPU compute resources. It does not matter if your environments are on-prem (even air-gapped), in a single cloud or multi-cloud, or a hybrid – we support them all. The AI Infrastructure Control Plane is part of ClearML’s AI Platform for streamlining AI adoption throughout the entire development lifecycle, from exploration to production. The AI Platform offers unmatched flexibility and control for teams building, training, and deploying models at every scale on any AI infrastructure.

Get More Value out of the Compute you Already Have

AI Builders and IT leaders use ClearML’s AI Infrastructure Control Plane to simplify the management of AI infrastructure through a single system, regardless of how large, how complex, or where the compute resides (on-prem, in the cloud, or hybrid), running Kubernetes or bare metal. Secure multi-tenancy is one method our AI Infrastructure Control Plane delivers to help large organizations increase utilization of shared compute. By creating independent, siloed networks for individual AI stakeholders, compute allocations can be closely managed inconspicuously, in a way that doesn’t cause downtime or issues with running AI workloads. You can easily shift compute allocations between tenants based on demand, improving overall utilization and reducing idle time.

Your teams can now also employ robust resource allocation policies to control user/tenant access to individual resources with rules and logic to support hierarchies, quotas, and over-quota, so expensive machines don’t sit idle. And the compute itself can be split using dynamic fractional GPUs for processing multiple right-sized workloads on a single chip. Our customers have successfully used these levers to drive AI throughput up to 10x. 

As an integral part of the ClearML AI Platform, using the AI Control Plane enables AI builders to self-serve compute and access their approved resources for scheduling and running workloads. This also works for organizations using tools for HPC such as Slurm and PBS, which benefit not only from using their clusters for more workloads, but also the additional visibility that ClearML provides into the queues and job statuses through our orchestration dashboards.

ClearML is truly software- and hardware-agnostic. While your infrastructure today may be all on-prem, or within a single cloud, or using only NVIDIA chips, ClearML gives your teams the flexibility to change what they buy tomorrow. 

Control Cloud Costs with Efficient Cloud Management

Cloud computing is already expensive, and increasing demand will only drive up hourly rates. The good news: ClearML’s AI Infrastructure Control Plane helps you stretch budgets by using your cloud compute more efficiently.

As an agnostic, end-to-end AI platform, ClearML gives customers the ability to use AWS, Azure, or GCP (or all three), with or without on-prem machines. It doesn’t matter where the compute lives. As long as it is configured correctly, AI Builders on ClearML have the same frictionless experience and the same instant access. AI teams looking to economize can employ features designed to help teams stretch cloud budgets through support for spot instances, with the ability to choose less-expensive zones based on availability, as well as use auto-scalers that spin up instances only when needed and that can be taken down after a set idle period. 

Organizations that also have on-premise compute can prioritize those machines for running AI workloads and only use cloud resources as a spill over, minimizing cloud usage. 

Gain Greater Visibility and Improved Governance

For organizations with complex computing infrastructure, visualizing all of your resources at the same time can be challenging. ClearML’s AI Infrastructure Control Plane makes overall governance much easier by serving as a single pane of glass for infrastructure leaders and administrators to view all resources, their statuses, as well as utilization. ClearML provides dashboards (such as our Orchestration Dashboard) and reporting to monitor resources on a macro level and also enables teams to drill down into what’s happening at the queue level. 

Data oversight is critical to governance, and ClearML makes it easy for teams to see which models are in use and are delivering data through its endpoint monitoring service. Your AI teams can quickly assess the number of live instances for a particular model endpoint as well as performance metrics such as uptime, requests, and latency. Implement secure multi-tenancy for maximum protection of data and compute between tenants. 

Next Steps

Generative AI (or any AI) comes with lofty revenue goals and a significant reality check on costs. Organizations will continue to scrutinize their AI investments to understand if those dollars are delivering the innovation they require. By enabling visibility and improving utilization across your infrastructure, ClearML’s AI Infrastructure Control Plane makes this ROI calculation easier. Learn more on our website or request a demo with our sales team.

Facebook
Twitter
LinkedIn
Scroll to Top