Skip to main content

Dataset Management with CLI and SDK

In this tutorial, we are going to manage the CIFAR dataset with clearml-data CLI, and then use ClearML's Dataset class to ingest the data.

Creating the Dataset

Downloading the Data

Before we can register the CIFAR dataset with clearml-data, we need to obtain a local copy of it.

Execute this python script to download the data

from clearml import StorageManager

manager = StorageManager()
dataset_path = manager.get_local_copy(
remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
)
# make sure to copy the printed value
print("COPY THIS DATASET PATH: {}".format(dataset_path))

Expected response:

COPY THIS DATASET PATH: ~/.clearml/cache/storage_manager/global/f2751d3a22ccb78db0e07874912b5c43.cifar-10-python_artifacts_archive_None

The script prints the path to the downloaded data. It will be needed later on.

Creating the Dataset

To create the dataset, execute the following command:

clearml-data create --project dataset_examples --name cifar_dataset

Expected response:

clearml-data - Dataset Management & Versioning CLI 
Creating a new dataset:
New dataset created id=ee1c35f60f384e65bc800f42f0aca5ec

Where ee1c35f60f384e65bc800f42f0aca5ec is the dataset ID.

Adding Files

Add the files we just downloaded to the dataset:

clearml-data add --files <dataset_path>

where dataset_path is the path that was printed earlier, which denotes the location of the downloaded dataset.

note

There's no need to specify a dataset_id, since the clearml-data session stores it.

Finalizing the Dataset

Run the close command to upload the files (it'll be uploaded to ClearML Server by default):

clearml-data close 

This command sets the dataset task's status to completed, so it will no longer be modifiable. This ensures future reproducibility.

Information about the dataset can be viewed in the WebApp, in the dataset's details panel. In the panel's CONTENT tab, you can see a table summarizing version contents, including file names, file sizes, and hashes.

Dataset content tab

Using the Dataset

Now that we have a new dataset registered, we can consume it.

The data_ingestion.py example script demonstrates using the dataset within Python code.

dataset_name = "cifar_dataset"
dataset_project = "dataset_examples"

from clearml import Dataset

dataset_path = Dataset.get(
dataset_name=dataset_name,
dataset_project=dataset_project).get_local_copy()

trainset = datasets.CIFAR10(
root=dataset_path,
train=True,
download=False,
transform=transform
)

The Dataset's get_local_copy method will return a path to the cached, downloaded dataset. Then we provide the path to Pytorch's dataset object.

The script then trains a neural network to classify images using the dataset created above.