Q&A: What Is ClearML Data and How Does It Work?

More often than not, I forget how tricky something can be to new people. When I have been doing something for a while, it’s obvious. To me. An example would be roasting my own coffee. It’s obvious (again, to me) that I have to keep a detailed diary of how much I put into the roaster. What temperature. How long. Sun dried or wet. The list goes on.

Recently I was reminded that this also happens a lot in computing. I mean a LOT. People ask what seem to be a simple question but it turns into a much deeper exposition. We usually have 2 or 3 of these satori (or “profound awakenings”) each day in the slack channel.

Take for example, this question and answer thread about clearml-data. Names have obviously been removed to protect the innocent 😉

The initial question was pretty simple. Someone was exploring ClearML to see if it fit their needs. They liked the look of the UI but didn’t see where the part that handled the data. No place for directly specifying the data sources etc.

This is understandable, only when you consider that a dataset is represented by a task (or experiment in UI terms). It is it’s own Task type, so it is easy to search and browse through them. With the addition of tags, this also means that you get finer granularity search capabilities.

 
The Real Question is ..

 

However, this then leads onto the “real” question, which was

For example, let's say you have a basic project in which the workflow is:
You read a csv stored in your filesystem.
You transform this csv adding some new features, scaling and things like that.
You train a model (usually doing several experiments with different hyperparameters).
You deploy the model and is ready for making predictions.
How would you structure this workflow in Tasks in ClearML?

The answer given demonstrates  one of the basic flows

1. use clearml-data and create a dataset from the local CSV file clearml-data

clearml-data sync --folder (where the csv file is)

2. Write some python that takes the csv file from the dataset and creates a
new dataset of the preprocessed data

from clearml import Dataset
original_csv_folder = Dataset.get(dataset_id=args.dataset).get_local_copy()

# process csv file -> generate a new csv

preprocessed = Dataset.create(...) preprocessed.add_files(new_created_file)
preprocessed.upload() preprocessed.close()

3. Train the model (i.e. get the dataset prepared in (2)), add output_uri to
upload the model (say to your S3 bucket of clearml-server)
preprocessed_csv_folder =
Dataset.get(dataset_id='preprocessed_dataset_if').get_local_copy()

4. Use the clearml model repository (see the Models Tab in the Project
experiment table) to get / download the trained model