Skip to main content

CLI

important

This page covers clearml-data, ClearML's file-based data management solution. See Hyper-Datasets for ClearML's advanced queryable dataset management solution.

The clearml-data utility is a CLI tool for controlling and managing your data with ClearML.

The following page provides a reference to clearml-data's CLI commands.

Creating a Dataset#

clearml-data create --project <project_name> --name <dataset_name> --parents <existing_dataset_id>`

Creates a new dataset.

Parameters

NameDescriptionOptional
--nameDataset's nameNo
--projectDataset's projectNo
--parentsIDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were enteredYes
--tagsDataset user tags. The dataset can be labeled, which can be useful for organizing datasetsYes
important

clearml-data works in a stateful mode so once a new dataset is created, the following commands do not require the --id flag.


Adding Files#

clearml-data add --id <dataset_id> --files <filenames/folders_to_add>

It's possible to add individual files or complete folders.

Parameters

NameDescriptionOptional
--idDataset's ID. Default: previously created / accessed datasetYes
--filesFiles / folders to add. Wildcard selection is supported, for example: ~/data/*.jpg ~/data/jsonNo
--dataset-folderDataset base folder to add the files to in the dataset. Default: dataset rootYes
--non-recursiveDisable recursive scan of filesYes
--verboseVerbose reportingYes

Removing Files#

clearml-data remove --id <dataset_id_to_remove_from> --files <filenames/folders_to_remove>

Parameters

NameDescriptionOptional
--idDataset's ID. Default: previously created / accessed datasetYes
--filesFiles / folders to remove (wildcard selection is supported, for example: ~/data/*.jpg ~/data/json). Notice: file path is the path within the dataset, not the local path.No
--non-recursiveDisable recursive scan of filesYes
--verboseVerbose reportingYes

Uploading Dataset Content#

clearml-data upload [--id <dataset_id>] [--storage <upload_destination>]

Uploads added files to ClearML Server by default. It's possible to specify a different storage medium by entering an upload destination, such as s3://bucket, gs://, azure://, /mnt/shared/.

Parameters

NameDescriptionOptional
--idDataset's ID. Default: previously created / accessed datasetYes
--storageRemote storage to use for the dataset files. Default: files_serverYes
--verboseVerbose reportingYes

Finalizing a Dataset#

clearml-data close --id <dataset_id>

Finalizes the dataset and makes it ready to be consumed. It automatically uploads all files that were not previously uploaded. Once a dataset is finalized, it can no longer be modified.

Parameters

NameDescriptionOptional
--idDataset's ID. Default: previously created / accessed datasetYes
--storageRemote storage to use for the dataset files. Default: files_serverYes
--disable-uploadDisable automatic upload when closing the datasetYes
--verboseVerbose reportingYes

Syncing Local Storage#

clearml-data sync [--id <dataset_id] --folder <folder_location> [--parents '<parent_id>']`

This option syncs a folder's content with ClearML. It is useful in case a user has a single point of truth (i.e. a folder) which updates from time to time.

Once an update should be reflected into ClearML's system, users can call clearml-data sync, create a new dataset, enter the folder, and the changes (either file addition, modification and removal) will be reflected in ClearML.

This command also uploads the data and finalizes the dataset automatically.

Parameters

NameDescriptionOptional
--idDataset's ID. Default: previously created / accessed datasetYes
--folderLocal folder to sync. Wildcard selection is supported, for example: ~/data/*.jpg ~/data/jsonNo
--storageRemote storage to use for the dataset files. Default: files_serverYes
--parentsIDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the datasetYes
--projectIf creating a new dataset, specify the dataset's project nameYes
--nameIf creating a new dataset, specify the dataset's nameYes
--tagsDataset user tagsYes
--skip-closeDo not auto close dataset after syncing foldersYes
--verboseVerbose reportingYes

Listing Dataset Content#

clearml-data list [--id <dataset_id>]

Parameters

NameDescriptionOptional
--idDataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed datasetYes
--projectSpecify dataset project name (if used instead of ID, dataset name is also required)Yes
--nameSpecify dataset name (if used instead of ID, dataset project is also required)Yes
--filterFilter files based on folder / wildcard. Multiple filters are supported. Example: folder/date_*.json folder/sub-folderYes
--modifiedOnly list file changes (add / remove / modify) introduced in this versionYes

Deleting a Dataset#

clearml-data delete [--id <dataset_id_to_delete>]

Deletes an entire dataset from ClearML. This can also be used to delete a newly created dataset.

This does not work on datasets with children.

Parameters

NameDescriptionOptional
--idID of dataset to be deleted. Default: previously created / accessed dataset that hasn't been finalized yetYes
--forceForce dataset deletion even if other dataset versions depend on itYes

Searching for a Dataset#

clearml-data search [--name <name>] [--project <project_name>] [--tags <tag>]

Lists all datasets in the system that match the search request.

Datasets can be searched by project, name, ID, and tags.

Parameters

NameDescriptionOptional
--idsA list of dataset IDs
--projectThe project name of the datasets
--nameA dataset name or a partial name to filter datasets by
--tagsA list of dataset user tags

Comparing Two Datasets#

clearml-data compare [--source SOURCE] [--target TARGET]

Compare two datasets (target vs. source). The command returns a comparison summary that looks like this:

Comparison summary: 4 files removed, 3 files modified, 0 files added

Parameters

NameDescriptionOptional
--sourceSource dataset id (used as baseline)No
--targetTarget dataset id (compare against the source baseline dataset)No
--verboseVerbose report all file changes (instead of summary)Yes

Merging Datasets#

clearml-data squash --name NAME --ids [IDS [IDS ...]]

Squash (merge) multiple datasets into a single dataset version.

Parameters

NameDescriptionOptional
--nameCreate squashed dataset nameNo
--idsSource dataset IDs to squash (merge down)No
--storageRemote storage to use for the dataset files. Default: files_serverYes
--verboseVerbose report all file changes (instead of summary)Yes

Verifying a Dataset#

clearml-data verify [--id ID] [--folder FOLDER]

Verify that the dataset content matches the data from the local source.

Parameters

NameDescriptionOptional
--idSpecify dataset ID. Default: previously created/accessed datasetYes
--folderSpecify dataset local copy (if not provided the local cache folder will be verified)Yes
--filesizeIf True, only verify file size and skip hash checks (default: false)Yes
--verboseVerbose report all file changes (instead of summary)Yes

Getting a Dataset#

clearml-data get [--id ID] [--copy COPY] [--link LINK] [--overwrite]

Get a local copy of a dataset. By default, you get a read only cached folder, but you can get a mutable copy by using the --copy flag.

Parameters

NameDescriptionOptional
--idSpecify dataset ID. Default: previously created / accessed datasetYes
--copyGet a writable copy of the dataset to a specific output folderYes
--linkCreate a soft link (not supported on Windows) to a read-only cached folder containing the datasetYes
--overwriteIf True, overwrite the target folderYes
--verboseVerbose report all file changes (instead of summary)Yes

Publishing a Dataset#

clearml-data publish --id ID

Publish the dataset for public use. The dataset must be finalized before it is published.

Parameters

NameDescriptionOptional
--idThe dataset task id to be published.No