Data management is ALL THE RAGE!

August 9, 2021

Data management is ALL THE RAGE!

Everyone wants to manage their data, and if it’s a feature store, even better! But for optimal data management, we must first discuss lightweight zero upfront setup costs and maximizing utility with ClearML-data.

ClearML-data mimics the light weightiness of git for data (who doesn’t know git?) and gives it a spin. It is an open-source dataset management tool which is extremely efficient and conveys how we view DataOps and its distinction from git-like solutions, including:

  • ‘Better-than-git’ for data exploration: preview your data, tag it, and easily go back in time (you can even query based on custom metadata)
  • A full, differential-based solution on top of object-storage / http / NAS layer
  • The ability to abstract Data from your Code
  • CLI and Python API to easily create datasets from anywhere

In short, we believe data is not code and thus shouldn’t be stored on a git tree as daily dataset work is usually non-linear. This keeps code and data uncoupled to allow for easier distinguishing between the same dataset versions with different code repositories, or for using the latest version of your code with a different dataset.

ClearML-Data uses the same infrastructure as ClearML, integrates nicely with your pipelines, and is both a standalone yet seamless part of our robust end-to-end MLOps solution.

For full flexibility, ClearML-Data supports both command-line and programmatic interfaces, and the dataset object can be inspected in the ClearML UI. To seal the deal, the diff-based caching also solves data localization problems (store once and use everywhere), while DataOps on your workflow datasets are within reach!

Since ClearML enables both experiment tracking and automation, we’ve made it easy to always get the ‘latest’ dataset version as an input to your pipeline, and we also make integration with existing code a breeze by granting the option to abstract transfers and super simple data caching.

ClearML-Data in ~3 lines of code :

1. Create:

2. Add:
Per users’ requests, we’ve made adding datasets easy bythe use of folders, so you only need to point to the appropriate folder and Viola it has been added. Of course, all such steps are recorded, so you have full reproducibility and automation available thanks to a script

3. List [optional]:
This provides another line-of-defense to view and review the added files and make sure everything is in its right place before uploading.

4. Close (finalize + upload):
Once approved and version is final, you’d select to upload the files (per ClearML version: Free for up to 100Gb, or select your own storage – on-prem, cloud, or hybrid)

All of the above take less than 5 minutes (well, depending on your upload speed), and for maximum effect you can add tags or metadata to make searching for datasets much easier and much faster!

ClearML-Data questions, comments, reviews, and more?

When choosing a data management solution, starting lightweight and adjusting as you go is the way to go! Give ClearML-Data a try 🙂