Cloud Instance Autoscaling and saving a buck…and a headache

August 2, 2021


What can we say: Research is non-linear, there are tests, and adjustments, and more tests, and more adjustments, and then we add more data, and test some more, and… you know the story.

With that, we also know there is no other way, and going through Deep learning effectively requires GPU machines. Oh yes, those are just great, but they are also pretty darn expensive. We start small with just a few machines, but when we need to go big (or meet some crazy deadline) we turn to our trusted cloud provider to help us scale, but we aren’t too fond of managing cloud machines, are we? I mean, who likes managing all those docker containers for multiple scripts? And who can remember turning those machines off? Ahh…It’s a headache to say the least!…which could end up costing a small fortune.

Side note: I personally know someone who wasted $20,000 in a weekend because they forgot to turn off their cloud instance. And don’t play like you don’t know someone like that as well 😉

But instead of chatting it up about wasted money or the burden of managing our cloud machines, let’s turn it over to…

How to save you money by GPU Autoscaling

Wouldn’t it be magical if someone could automatically know when I need a machine, spin it up for me, and even spin it down when it’s no longer needed?

Well I’ve personally met this magical creature, and it’s called an Autoscaler.

Still, let’s not get ahead of ourselves.

While the Autoscaler makes sure we don’t waste good money on unused (and very expensive) instances, we also need to make sure every instance has its correct environment.

Every single experiment can have a different Python version, or a Pytorch version, or a CUDA version, or numpy version or…you know what I’m talking about. Or maybe a (manual) management of Docker containers, where the ‘latest_docker_python3.7_pytorch1.8.1’ is the right one? Correct?

Saying this is a tedious process would be a true understatement.

So how do we?

Automatically recreate our needed environment

For this second half, and just as important a task, ClearML-Agent comes to the rescue and recreates this most needed environment right for you. And if you’re not too fond of building dockerfiles yourself it will even build a Docker container for you (but you must say please ;)).

…and simply ensures your code…just…runs.

And all this comes for FREE with the Autoscaler! Just choose a cloud instance image, and the autoscaler will make sure Clearml-agent is up and ready to execute experiments automatically!

In Summary

Researchers need a lot of tests, changes, data, and to rinse and repeat, which could end up really expensive with GPUs or in the Cloud, and either by your choice or by sheer accident.

But with the help of Autoscaler and ClearML Agent, well… those work really well together, and you can both save a ton of money and make sure the right environment is being recreated for you.

The autoscaler comes in a deploy-it-yourself code form or a hosted application.


And for those who ClearML-Agent piqued their interest, here’s a little bit more:

  • Complete reproduction of execution environment.
  • Clone and modify code without committing EVERY LITTLE CHANGE!
  • On-the fly parameter changes without any code change required.
  • Uniform interface for On-Prem, Cloud, and Hybrid configurations..
  • Supports Job prioritization.
  • Bonus: builds a docker container from an experiment.

…or the whole nine-yards of ClearML-Agent.