Skip to main content

ClearML Data

In Machine Learning, you are very likely dealing with a gargantuan amount of data that you need to put in a dataset, which you then need to be able to share, reproduce, and track.

ClearML Data Management solves two important challenges:

  • Accessibility - Making data easily accessible from every machine,
  • Versioning - Linking data and experiments for better traceability.

We believe Data is not code. It should not be stored in a git tree, because progress on datasets is not always linear. Moreover, it can be difficult and inefficient to find on a git tree the commit associated with a certain version of a dataset.

A clearml-data dataset is a collection of files, stored on a central storage location (S3 / GS / Azure / Network Storage). Datasets can be set up to inherit from other datasets, so data lineages can be created, and users can track when and how their data changes.

Dataset changes are stored using differentiable storage, meaning a version will store the change-set from its previous dataset parents.

Local copies of datasets are always cached, so the same data never needs to be downloaded twice. When a dataset is pulled it will automatically pull all parent datasets and merge them into one output folder for you to work with.

ClearML Data offers two interfaces:

  • clearml-data - CLI utility for creating, uploading, and managing datasets.
  • clearml.Dataset - A python interface for creating, retrieving, managing, and using datasets.

Setup#

clearml-data comes built-in with our clearml python package! Just check out the getting started guide for more info!

Workflow#

Below is an example of a workflow using ClearML Data's command line tool to create a dataset and inegrating the dataset into code using ClearML Data's python interface.

Creating a Dataset#

Using the clearml-data CLI, users can create datasets using the following commands:

clearml-data create --project dataset_example --name initial_version
clearml-data add --files data_folder
clearml-data close

The commands will do the following:

  1. Start a Data Processing Task called "initial_version" in the "dataset_example" project

  2. The CLI will return a unique ID for the dataset

  3. All the files from the "data_folder" folder will be added to the dataset and uploaded by default to the ClearML server.

  4. The dataset will be finalized, making it immutable and ready to be consumed.

note

clearml-data is stateful and remembers the last created dataset so there's no need to specify a specific dataset ID unless we want to work on another dataset.

Using a Dataset#

Now in our python code, we can access and use the created dataset from anywhere:

from clearml import Dataset
local_path = Dataset.get(dataset_id='dataset_id_from_previous_command').get_local_copy()

We have all our files in the same folder structure under local_path, it is that simple!

The next step is to set the dataset_id as a parameter for our code and voilà! We can now train on any dataset we have in the system.

CLI Options#

It's possible to manage datasets (create / modify / upload / delete) with the clearml-data command line tool.

Creating a Dataset#

clearml-data create --project <project_name> --name <dataset_name> --parents <existing_dataset_id>`

Creates a new dataset.

Parameters

NameDescriptionOptional
nameDataset's nameNo
projectDataset's projectNo
parentsIDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were enteredYes
tagsDataset user tags. The dataset can be labeled, which can be useful for organizing datasetsYes
important

clearml-data works in a stateful mode so once a new dataset is created, the following commands do not require the --id flag.


Add Files#

clearml-data add --id <dataset_id> --files <filenames/folders_to_add>

It's possible to add individual files or complete folders.

Parameters

NameDescriptionOptional
idDataset's ID. Default: previously created / accessed datasetYes
filesFiles / folders to add. Wildcard selection is supported, for example: ~/data/*.jpg ~/data/jsonNo
dataset-folderDataset base folder to add the files to in the dataset. Default: dataset rootYes
non-recursiveDisable recursive scan of filesYes
verboseVerbose reportingYes

Remove Files#

clearml-data remove --id <dataset_id_to_remove_from> --files <filenames/folders_to_remove>

Parameters

NameDescriptionOptional
idDataset's ID. Default: previously created / accessed datasetYes
filesFiles / folders to remove (wildcard selection is supported, for example: ~/data/*.jpg ~/data/json). Notice: file path is the path within the dataset, not the local path.No
non-recursiveDisable recursive scan of filesYes
verboseVerbose reportingYes

Upload Dataset Content#

clearml-data upload [--id <dataset_id>] [--storage <upload_destination>]

Uploads added files to ClearML Server by default. It's possible to specify a different storage medium by entering an upload destination, such as s3://bucket, gs://, azure://, /mnt/shared/.

Parameters

NameDescriptionOptional
idDataset's ID. Default: previously created / accessed datasetYes
storageRemote storage to use for the dataset files. Default: files_serverYes
verboseVerbose reportingYes

Finalize Dataset#

clearml-data close --id <dataset_id>

Finalizes the dataset and makes it ready to be consumed. It automatically uploads all files that were not previously uploaded. Once a dataset is finalized, it can no longer be modified.

Parameters

NameDescriptionOptional
idDataset's ID. Default: previously created / accessed datasetYes
storageRemote storage to use for the dataset files. Default: files_serverYes
disable-uploadDisable automatic upload when closing the datasetYes
verboseVerbose reportingYes

Sync Local Folder#

clearml-data sync [--id <dataset_id] --folder <folder_location> [--parents '<parent_id>']`

This option syncs a folder's content with ClearML. It is useful in case a user has a single point of truth (i.e. a folder) which updates from time to time.

Once an update should be reflected into ClearML's system, users can call clearml-data sync, create a new dataset, enter the folder, and the changes (either file addition, modification and removal) will be reflected in ClearML.

This command also uploads the data and finalizes the dataset automatically.

Parameters

NameDescriptionOptional
idDataset's ID. Default: previously created / accessed datasetYes
folderLocal folder to sync. Wildcard selection is supported, for example: ~/data/*.jpg ~/data/jsonNo
storageRemote storage to use for the dataset files. Default: files_serverYes
parentsIDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the datasetYes
projectIf creating a new dataset, specify the dataset's project nameYes
nameIf creating a new dataset, specify the dataset's nameYes
tagsDataset user tagsYes
skip-closeDo not auto close dataset after syncing foldersYes
verboseVerbose reportingYes

List Dataset Content#

clearml-data list [--id <dataset_id>]

Parameters

NameDescriptionOptional
idDataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed datasetYes
projectSpecify dataset project name (if used instead of ID, dataset name is also required)Yes
nameSpecify dataset name (if used instead of ID, dataset project is also required)Yes
filterFilter files based on folder / wildcard. Multiple filters are supported. Example: folder/date_*.json folder/sub-folderYes
modifiedOnly list file changes (add / remove / modify) introduced in this versionYes

Delete Dataset#

clearml-data delete [--id <dataset_id_to_delete>]

Deletes an entire dataset from ClearML. This can also be used to delete a newly created dataset.

This does not work on datasets with children.

Parameters

NameDescriptionOptional
idID of dataset to be deleted. Default: previously created / accessed dataset that hasn't been finalized yetYes
forceForce dataset deletion even if other dataset versions depend on itYes

Search for a Dataset#

clearml-data search [--name <name>] [--project <project_name>] [--tags <tag>]

Lists all datasets in the system that match the search request.

Datasets can be searched by project, name, ID, and tags.

Parameters

NameDescriptionOptional
idsA list of dataset IDs
projectThe project name of the datasets
nameA dataset name or a partial name to filter datasets by
tagsA list of dataset user tags

Compare Two Datasets#

clearml-data compare [--source SOURCE] [--target TARGET]

Compare two datasets (target vs. source). The command returns a comparison summary that looks like this:

Comparison summary: 4 files removed, 3 files modified, 0 files added

Parameters

NameDescriptionOptional
sourceSource dataset id (used as baseline)No
targetTarget dataset id (compare against the source baseline dataset)No
verboseVerbose report all file changes (instead of summary)Yes

Merge Datasets#

clearml-data squash --name NAME --ids [IDS [IDS ...]]

Squash (merge) multiple datasets into a single dataset version.

Parameters

NameDescriptionOptional
nameCreate squashed dataset nameNo
idsSource dataset IDs to squash (merge down)No
storageRemote storage to use for the dataset files. Default: files_serverYes
verboseVerbose report all file changes (instead of summary)Yes

Verify Dataset#

clearml-data verify [--id ID] [--folder FOLDER]

Verify that the dataset content matches the data from the local source.

Parameters

NameDescriptionOptional
idSpecify dataset ID. Default: previously created/accessed datasetYes
folderSpecify dataset local copy (if not provided the local cache folder will be verified)Yes
filesizeIf True, only verify file size and skip hash checks (default: false)Yes
verboseVerbose report all file changes (instead of summary)Yes

Get a Dataset#

clearml-data get [--id ID] [--copy COPY] [--link LINK] [--overwrite]

Get a local copy of a dataset. By default, you get a read only cached folder, but you can get a mutable copy by using the --copy flag.

Parameters

NameDescriptionOptional
idSpecify dataset ID. Default: previously created / accessed datasetYes
copyGet a writable copy of the dataset to a specific output folderYes
linkCreate a soft link (not supported on Windows) to a read-only cached folder containing the datasetYes
overwriteIf True, overwrite the target folderYes
verboseVerbose report all file changes (instead of summary)Yes

Publish a Dataset#

clearml-data publish --id ID

Publish the dataset for public use. The dataset must be finalized before it is published.

Parameters

NameDescriptionOptional
idThe dataset task id to be published.No

Python API#

It's also possible to manage a dataset using ClearML Data's python interface.

All API commands should be imported with:

from clearml import Dataset

See all API commands in the Dataset reference page.

Tutorials#

Take a look at the ClearML Data tutorials: