Cloud vendor’s MLOps or Open source? – guest blogpost

April 14, 2022

Originally Published by – Republished by author approval

For MLOps Tools, Think Hard Before Saying “I Do” to Your Cloud Provider


If someone had told my 15-years-ago self that I’d become a DevOps engineer, I’d have scratched my head and asked them to repeat that. Back then, of course, applications were either maintained on a dedicated server or (sigh!) installed on end-user machines with little control or flexibility.

Today, these paradigms are essentially obsolete; cloud computing is ubiquitous and successful. Even non-technical employees manage SaaS products efficiently and confidently – but the people who ensure that it all hums along smoothly behind the scene? That’s us, the DevOps folks. 

This same paradigm shift is now happening, as we all know, in the ML space … with MLOps. 

And it’s an evolution: When I started my job in an ML company, I was just a DevOps guy helping the data science team organize their work and deploy it. I leveraged my practical knowledge and best practices for managing infrastructure (mainly Kubernetes).

We cranked along, and one day I woke up like Gregor Samsa to discover I’d turned into an MLOps engineer.


Build or Buy vs. Marrying Your Cloud Provider

We all know the old “Build or Buy” conundrum for tech in general, and by now, we’ve all heard the practical “don’t build MLOps tools — just buy best of breed and focus on what you are truly good at.” As time goes by, I’ve watched this general approach gain broad acceptance for MLOps tools. 

But now the focus shifts to choosing the right product for you … and here’s where it gets interesting. 

When coming to choose an MLOps provider, we can either:

  1. Choose an integrated solution from our existing cloud vendor
  2. Choose a dedicated solution from an “external” vendor

This decision can then be further broken down into choosing an end-to-end tool to handle most (or even all) aspects of your MLOps workflow, or selecting a best-of-breed solution that does, in fact, requires some integration.

At first glance, the initial choice seems obvious. After all, we already use the services of a specific cloud vendor. We know how to use their APIs; we are already paying them (or using credits), so adding another service isn’t that big a deal. We’d assume their solution would probably integrate well with their compute resources that we are probably using, so it’s an easy choice. Right? 

Perhaps. But let’s take a step back and think it through. 


The Dark Side of a Cloud Vendor’s MLOps

Yes, it’s really easy to start using a cloud vendor’s offering. But when thinking long-term, we have to take specific considerations into account regarding the impact of this decision. 

I’m not going to go down the rabbit hole of specific technical details and features because, for both startups and larger companies, each organization’s needs, priorities, and existing workflows/infrastructure will probably be different than mine. Instead, I’ll simply outline the factors to consider for your specific development scenario.

  1. The dreaded vendor lock-in – I don’t think I need to say much more. We rely on our cloud providers so much that it’s almost inevitable: One day (when prices jump, offerings change, etc.), we’ll have no choice but to pay more – or pay the costs (in time, more than money) associated with migrating to another vendor. In the case of MLOps, this eventuality is truly painful because a lot of experimentation history is saved, and moving it to another tool can sit anywhere between “We can’t do it!” to “OMG, it costs that much?” 
  2. Base Cost – The initial price quote might not appear as a significant issue at first. You usually don’t start big, and when starting with MLops – like any new field – quick wins are very important. And yes, a cloud infrastructure gives you the flexibility to scale as you need. But long term, for medium-heavy workloads, the cloud is just more expensive. Yes, buying a GPU rig has a high upfront cost, but it returns itself. And I did not even mention managed services like managing Kubernetes clusters or even managing training machines that add a premium on top of the bare hardware cost. From my experience, the ROI is about two years, so this is not a clear financial win.
  3. Hybridity – If you have local machines, the cloud vendor’s offering just doesn’t hack it. If you try this approach, you have to implement and manage the bridge between local machine management and the cloud-based MLOps solution. Yes, it’s doable, but far from ideal. After all, MLOps is here to save us work, not to create more. 
  4. Speed – When cloud providers expand their services, they make a conference out of it 🙂 It’s a long, heavy, gradual process as they develop and roll out new capabilities and features. But when it comes to solving routine, often “tiny” but ridiculously annoying UI problems? That’s rarely a priority, and yeah, it might never happen. Small vendors, at least from my experience, can be much more agile. And heck — If you choose an OSS solution (like the one I use, ClearML), you can modify the code yourself, and even get help from the support team doing so! 

In short, the cloud has a long, long list of benefits for every stakeholder, from the B2C end-user to the mega-corp CTO. But for this one specific dilemma, your cloud vendor may not be the be-all-end-all answer to a technical challenge.