Skip to main content

Dataset Management with CLI and SDK

In this tutorial, we are going to manage the CIFAR dataset with clearml-data CLI, and then use ClearML's Dataset class to ingest the data.

Creating the Dataset#

Downloading the Data#

Before we can register the CIFAR dataset with clearml-data, we need to obtain a local copy of it.

Execute this python script to download the data

from clearml import StorageManager
manager = StorageManager()
dataset_path = manager.get_local_copy(
remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
)
# make sure to copy the printed value
print("COPY THIS DATASET PATH: {}".format(dataset_path))

Expected response:

COPY THIS DATASET PATH: ~/.clearml/cache/storage_manager/global/f2751d3a22ccb78db0e07874912b5c43.cifar-10-python_artifacts_archive_None

The script prints the path to the downloaded data. It will be needed later on.

Creating the Dataset#

To create the dataset, execute the following command:

clearml-data create --project dataset_examples --name cifar_dataset

Expected response:

clearml-data - Dataset Management & Versioning CLI
Creating a new dataset:
New dataset created id=ee1c35f60f384e65bc800f42f0aca5ec

Where ee1c35f60f384e65bc800f42f0aca5ec is the dataset ID.

Adding Files#

Add the files we just downloaded to the dataset:

clearml-data add --files <dataset_path>

where dataset_path is the path that was printed earlier, which denotes the location of the downloaded dataset.

note

There's no need to specify a dataset_id, since the clearml-data session stores it.

Finalizing the Dataset#

Run the close command to upload the files (it'll be uploaded to ClearML Server by default):

clearml-data close

This command sets the dataset task's status to completed, so it will no longer be modifiable. This ensures future reproducibility.

The information about the dataset, including a list of files and their sizes, can be viewed in the WebApp, in the dataset task's ARTIFACTS tab.

image

Using the Dataset#

Now that we have a new dataset registered, we can consume it.

The data_ingestion.py example script demonstrates using the dataset within Python code.

dataset_name = "cifar_dataset"
dataset_project = "dataset_examples"
from clearml import Dataset
dataset_path = Dataset.get(
dataset_name=dataset_name,
dataset_project=dataset_project
).get_local_copy()
trainset = datasets.CIFAR10(
root=dataset_path,
train=True,
download=False,
transform=transform
)

The Dataset's get_local_copy method will return a path to the cached, downloaded dataset. Then we provide the path to Pytorch's dataset object.

The script then trains a neural network to classify images using the dataset created above.