ClearML-Data Lemonade: getting local datasets quickly and easily

October 21, 2021

Congratulations on creating a clean(ish) dataset to use for training!

Now while the dataset is stored where it’s accessible to everyone, the distribution itself is a hassle!

Local workstations, local GPU machines, and cloud machines (that may be spun up and down without disk persistence) are getting data everywhere.

…and to say it is annoying is an understatement!

Of course, sometimes dev teams will play along, have time to spare, and are more than happy to help a Data Scientist out; but usually (and often) it isn’t as rosy, and you just end up scp-ing the data manually to every machine. And if the data is live and changing (it is), then keeping track is what we call a tedious afternoon activity we’d love to avoid.

There must be a better way!

Shouldn’t getting a dataset be as simple as asking for it? especially as it’s such a needed and frequent task?

ClearML-Data to the rescue

ClearML Data understands data management is all the rage and solves this in a rather elegant way (we’re completely unbiased ;)).
You simply retrieve your dataset object, and with one command, it’s there!

local_path = Dataset.get(dataset_id='dataset_id_').get_local_copy()

Is it in S3? Is it in Azure storage? Is it available locally? Who knows and who cares! It’s here now and when you ask for it.

In addition, ClearML Data also:

  • Caches your dataset: don’t download the same stuff twice
  • Optimizes transfers: if partial datasets are local, ClearML Data will only retrieve the delta
  • Versions your data: always access the latest version

Sounds like a feature that should be a given, a go-to, a default, right?

That’s exactly how we roll here at ClearML and why we made it so.

What does get_local_copy() actually mean?

For Data Scientists:

  • Getting the latest version of a dataset
  • Persistent cache download (for reuse)
  • Maxing out download rates
  • Automatic authentication

In short, a single line of code makes sure you quickly access the latest version of your dataset, no matter where you choose to run your code.

For Machine Learning Engineers:
By using ClearML-Data and get_local_copy(), your pipelines become truly uncoupled from the data, while caching combined with binary diffs helps to easily re-run pipelines with new versions of your dataset and with minimal download times.

For DevOps:
Letting the DS and MLE’s self-serve both version control and interim data transfer removes quite a bit from your plate. The only thing left to do is manage others’ access to the storage you manage.

Bonus: Train while downloading

And to leave you with a little teaser, ClearML Enterprise’s ClearML Hyper-Datasets allow you to also begin training while you’re downloading. It’s great if your dataset is more than a couple of GB in size where your expensive GPU instances are actually using the GPU instead of waiting for data to download!

Intrigued? Check out our documentation.

Questions, feedback, and more?

We have a live example using ClearML-Data in a pipeline, where IDs and Artifacts are created on the fly, so you don’t have to copy-paste or memorize any ID.