Skip to main content

ClearML Data CLI

important

This page covers clearml-data, ClearML's file-based data management solution. See Hyper-Datasets for ClearML's advanced queryable dataset management solution.

clearml-data is a data management CLI tool that comes as part of the clearml Python package. Use clearml-data to create, modify, and manage your datasets. You can upload your dataset to any storage service of your choice (S3 / GS / Azure / Network Storage) by setting the dataset's upload destination (see --storage). Once you have uploaded your dataset, you can access it from any machine.

The following page provides a reference to clearml-data's CLI commands.

create

Creates a new dataset.

clearml-data create [-h] [--parents [PARENTS [PARENTS ...]]] [--project PROJECT]
--name NAME [--version VERSION] [--storage STORAGE]
[--tags [TAGS [TAGS ...]]]

Parameters

NameDescriptionMandatory
--nameDataset's nameYes
--projectDataset's projectYes
--versionDataset version. Use the semantic versioning scheme. If not specified a version will automatically be assignedNo
--parentsIDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were enteredNo
--storageNetwork storage target to upload the dataset files and associated information (Default: files_server).
For example:
  • A shared folder: /mnt/share/folder
  • S3: s3://bucket/folder
  • Non-AWS S3-like services (such as MinIO): s3://host_addr:port/bucket. Note that port specification is required.
  • Google Cloud Storage: gs://bucket-name/folder
  • Azure Storage: azure://<account name>.blob.core.windows.net/path/to/file
No
--tagsDataset user tags. The dataset can be labeled, which can be useful for organizing datasetsNo
Dataset ID
  • For datasets created with clearml v1.6 or newer on ClearML Server v1.6 or newer, find the ID in the dataset version's info panel in the Dataset UI.
    For datasets created with earlier versions of clearml, or if using an earlier version of ClearML Server, find the ID in the task header of the dataset task's info panel.
  • clearml-data works in a stateful mode so once a new dataset is created, the following commands do not require the --id flag.

add

Add individual files or complete folders to the dataset.

clearml-data add [-h] [--id ID] [--dataset-folder DATASET_FOLDER]
[--files [FILES [FILES ...]]] [--wildcard [WILDCARD [WILDCARD ...]]]
[--links [LINKS [LINKS ...]]] [--non-recursive] [--verbose]

Parameters

NameDescriptionMandatory
--idDataset's ID. Default: previously created / accessed datasetNo
--filesFiles / folders to add. Items will be uploaded to the dataset's designated storage.No
--wildcardAdd specific set of files, denoted by these wildcards. For example: ~/data/*.jpg ~/data/json. Multiple wildcards can be passed.No
--linksFiles / folders link to add. Supports S3, GS, Azure links. Example: s3://bucket/data azure://<account name>.blob.core.windows.net/path/to/file. Items remain in their original location.No
--dataset-folderDataset base folder to add the files to in the dataset. Default: dataset root.No
--non-recursiveDisable recursive scan of filesNo
--verboseVerbose reportingNo

remove

Remove files/links from the dataset.

clearml-data remove [-h] [--id ID] [--files [FILES [FILES ...]]]
[--non-recursive] [--verbose]

Parameters

NameDescriptionMandatory
--idDataset's ID. Default: previously created / accessed datasetNo
--filesFiles / folders to remove (wildcard selection is supported, for example: ~/data/*.jpg ~/data/json). Notice: file path is the path within the dataset, not the local path. For links, you can specify their URL (for example, s3://bucket/data)Yes
--non-recursiveDisable recursive scan of filesNo
--verboseVerbose reportingNo

upload

Upload the local dataset changes to the server. By default, it's uploaded to the ClearML file server. You can specify a different storage medium by entering an upload destination. For example:

  • A shared folder: /mnt/shared/folder
  • S3: s3://bucket/folder
  • Non-AWS S3-like services (such as MinIO): s3://host_addr:port/bucket. Note that port specification is required.
  • Google Cloud Storage: gs://bucket-name/folder
  • Azure Storage: azure://<account name>.blob.core.windows.net/path/to/file
clearml-data upload [-h] [--id ID] [--storage STORAGE] [--chunk-size CHUNK_SIZE]
[--verbose]

Parameters

NameDescriptionMandatory
--idDataset's ID. Default: previously created / accessed datasetNo
--storageRemote storage to use for the dataset files. Default: files_serverNo
--chunk-sizeSet dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks.No
--verboseVerbose reportingNo

close

Finalize the dataset and make it ready to be consumed. This automatically uploads all files that were not previously uploaded. Once a dataset is finalized, it can no longer be modified.

clearml-data close [-h] [--id ID] [--storage STORAGE] [--disable-upload]
[--chunk-size CHUNK_SIZE] [--verbose]

Parameters

NameDescriptionMandatory
--idDataset's ID. Default: previously created / accessed datasetNo
--storageNetwork storage target to upload the dataset files and associated information (Default: files_server).
For example:
  • A shared folder: /mnt/share/folder
  • S3: s3://bucket/folder
  • Non-AWS S3-like services (such as MinIO): s3://host_addr:port/bucket. Note that port specification is required.
  • Google Cloud Storage: gs://bucket-name/folder
  • Azure Storage: azure://<account name>.blob.core.windows.net/path/to/file
No
--disable-uploadDisable automatic upload when closing the datasetNo
--chunk-sizeSet dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks.No
--verboseVerbose reportingNo

sync

Sync a folder's content with ClearML. This option is useful in case a user has a single point of truth (i.e. a folder) which updates from time to time.

Once an update should be reflected in ClearML's system, call clearml-data sync and pass the folder path, and the changes (either file addition, modification and removal) will be reflected in ClearML.

This command also uploads the data and finalizes the dataset automatically.

clearml-data sync [-h] [--id ID] [--dataset-folder DATASET_FOLDER] --folder FOLDER
[--parents [PARENTS [PARENTS ...]]] [--project PROJECT] [--name NAME]
[--version VERSION] [--storage STORAGE] [--tags [TAGS [TAGS ...]]]
[--storage STORAGE] [--skip-close] [--chunk-size CHUNK_SIZE] [--verbose]

Parameters

NameDescriptionMandatory
--idDataset's ID. Default: previously created / accessed datasetNo
--dataset-folderDataset base folder to add the files to (default: Dataset root)No
--folderLocal folder to sync. Wildcard selection is supported, for example: ~/data/*.jpg ~/data/jsonYes
--storageNetwork storage target to upload the dataset files and associated information (Default: files_server).
For example:
  • A shared folder: /mnt/share/folder
  • S3: s3://bucket/folder
  • Non-AWS S3-like services (such as MinIO): s3://host_addr:port/bucket. Note that port specification is required.
  • Google Cloud Storage: gs://bucket-name/folder
  • Azure Storage: azure://<account name>.blob.core.windows.net/path/to/file
No
--parentsIDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the datasetNo
--projectIf creating a new dataset, specify the dataset's project nameNo
--nameIf creating a new dataset, specify the dataset's nameNo
--versionSpecify the dataset's version using the semantic versioning scheme. Default: 1.0.0No
--tagsDataset user tagsNo
--skip-closeDo not auto close dataset after syncing foldersNo
--chunk-sizeSet dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks.No
--verboseVerbose reportingNo

list

List a dataset's contents.

clearml-data list [-h] [--id ID] [--project PROJECT] [--name NAME] [--version VERSION]
[--filter [FILTER [FILTER ...]]] [--modified]

Parameters

NameDescriptionMandatory
--idDataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed datasetNo
--projectSpecify dataset project name (if used instead of ID, dataset name is also required)No
--nameSpecify dataset name (if used instead of ID, dataset project is also required)No
--versionSpecify dataset version. Default: most recent versionNo
--filterFilter files based on folder / wildcard. Multiple filters are supported. Example: folder/date_*.json folder/subfolderNo
--modifiedOnly list file changes (add / remove / modify) introduced in this versionNo

set-description

Sets the description of an existing dataset.

clearml-data set-description [-h] [--id ID] [--description DESCRIPTION]

Parameters

NameDescriptionMandatory
--idDataset's IDYes
--descriptionDescription to be setYes

delete

Deletes dataset(s). Pass any of the attributes of the dataset(s) you want to delete. Multiple datasets matching the request will raise an exception, unless you pass --entire-dataset and --force. In this case, all matching datasets will be deleted.

If a dataset is a parent to a dataset(s), you must pass --force to delete it.

warning

Deleting a parent dataset may cause child datasets to lose data!

clearml-data delete [-h] [--id ID] [--project PROJECT] [--name NAME]
[--version VERSION] [--force] [--entire-dataset]

Parameters

NameDescriptionMandatory
--idID of the dataset to delete (alternatively, use project / name combination).No
--projectSpecify dataset project name (if used instead of ID, dataset name is also required)No
--nameSpecify dataset name (if used instead of ID, dataset project is also required)No
--versionSpecify dataset versionNo
-–forceForce dataset deletion even if other dataset versions depend on it. Must also be used if --entire-dataset flag is usedNo
--entire-datasetDelete all found datasetsNo

rename

Rename a dataset (and all of its versions).

clearml-data rename [-h] --new-name NEW_NAME --project PROJECT --name NAME

Parameters

NameDescriptionMandatory
--new-nameThe new name of the datasetYes
--projectThe project the dataset to be renamed belongs toYes
--nameThe current name of the dataset(s) to be renamedYes

move

Moves a dataset to another project

clearml-data move [-h] --new-project NEW_PROJECT --project PROJECT --name NAME

Parameters

NameDescriptionMandatory
--new-projectThe new project of the datasetYes
--projectThe current project the dataset to be move belongs toYes
--nameThe name of the dataset to be movedYes

Search datasets in the system by project, name, ID, and/or tags.

Returns list of all datasets in the system that match the search request, sorted by creation time.

clearml-data search [-h] [--ids [IDS [IDS ...]]] [--project PROJECT]
[--name NAME] [--tags [TAGS [TAGS ...]]]

Parameters

NameDescriptionMandatory
--idsA list of dataset IDsNo
--projectThe project name of the datasetsNo
--nameA dataset name or a partial name to filter datasets byNo
--tagsA list of dataset user tagsNo

compare

Compare two datasets (target vs. source). The command returns a comparison summary that looks like this: Comparison summary: 4 files removed, 3 files modified, 0 files added

clearml-data compare [-h] --source SOURCE --target TARGET [--verbose]

Parameters

NameDescriptionMandatory
--sourceSource dataset ID (used as baseline)Yes
--targetTarget dataset ID (compare against the source baseline dataset)Yes
--verboseVerbose report all file changes (instead of summary)No

squash

Squash multiple datasets into a single dataset version (merge down).

clearml-data squash [-h] --name NAME --ids [IDS [IDS ...]] [--storage STORAGE] [--verbose]

Parameters

NameDescriptionMandatory
--nameSquashed dataset nameYes
--idsSource dataset IDs to squash (merge down)Yes
--storageNetwork storage target to upload the dataset files and associated information (Default: files_server).
For example:
  • A shared folder: /mnt/share/folder
  • S3: s3://bucket/folder
  • Non-AWS S3-like services (such as MinIO): s3://host_addr:port/bucket. Note that port specification is required.
  • Google Cloud Storage: gs://bucket-name/folder
  • Azure Storage: azure://<account name>.blob.core.windows.net/path/to/file
No
--verboseVerbose report all file changes (instead of summary)No

verify

Verify that the dataset content matches the data from the local source.

clearml-data verify [-h] [--id ID] [--folder FOLDER] [--filesize] [--verbose]

Parameters

NameDescriptionMandatory
--idSpecify dataset ID. Default: previously created/accessed datasetNo
--folderSpecify dataset local copy (if not provided the local cache folder will be verified)No
--filesizeIf True, only verify file size and skip hash checks (default: False)No
--verboseVerbose report all file changes (instead of summary)No

get

Get a local copy of a dataset. By default, you get a read only cached folder, but you can get a mutable copy by using the --copy flag.

clearml-data get [-h] [--id ID] [--copy COPY] [--link LINK] [--part PART]
[--num-parts NUM_PARTS] [--overwrite] [--verbose]

Parameters

NameDescriptionMandatory
--idSpecify dataset ID. Default: previously created / accessed datasetNo
--copyGet a writable copy of the dataset to a specific output folderNo
--linkCreate a soft link (not supported on Windows) to a read-only cached folder containing the datasetNo
--partRetrieve a partial copy of the dataset. Part number (0 to --num-parts-1) of total parts --num-parts.No
--num-partsTotal number of parts to divide the dataset into. Notice, minimum retrieved part is a single chunk in a dataset (or its parents). Example: Dataset gen4, with 3 parents, each with a single chunk, can be divided into 4 partsNo
--overwriteIf True, overwrite the target folderNo
--verboseVerbose report all file changes (instead of summary)No

publish

Publish the dataset for public use. The dataset must be finalized before it is published.

clearml-data publish [-h] --id ID

Parameters

NameDescriptionMandatory
--idThe dataset task ID to be published.Yes