Skip to main content

SDK

important

This page covers clearml-data, ClearML's file-based data management solution. See Hyper-Datasets for ClearML's advanced queryable dataset management solution.

Datasets can be created, modified, and managed with ClearML Data's python interface. You can upload your dataset to any storage service of your choice (S3 / GS / Azure / Network Storage) by setting the dataset’s upload destination (see output_url parameter of Dataset.upload method). Once you have uploaded your dataset, you can access it from any machine.

The following page provides an overview for using the most basic methods of the Dataset class. See the Dataset reference page for a complete list of available methods.

Import the Dataset class, and let's get started!

from clearml import Dataset

Creating Datasets#

ClearML Data supports multiple ways to create datasets programmatically, which provides for a variety of use-cases:

  • Dataset.create() - Create a new dataset. Parent datasets can be specified, from which the new dataset will inherit its data
  • Dataset.squash() - Generate a new dataset from by squashing together a set of related datasets

Dataset.create()#

Use the Dataset.create class method to create a dataset.

Creating datasets programmatically is especially helpful when preprocessing the data so that the preprocessing code and the resulting dataset are saved in a single task (see use_current_task parameter in Dataset.create).

# Preprocessing code here
dataset = Dataset.create(
dataset_name='dataset name',
dataset_project='dataset project',
parent_datasets=[PARENT_DS_ID_1, PARENT_DS_ID_2],
dataset_version="1.0",
output_uri="gs://bucket-name/folder",
description='my dataset description'
)
Locating Dataset ID

For datasets created with clearml v1.6 or newer on ClearML Server v1.6 or newer, find the ID in the dataset version’s info panel in the Dataset UI.
For datasets created with earlier versions of clearml, or if using an earlier version of ClearML Server, find the ID in the task header of the dataset task's info panel.

Dataset Version

Input the dataset's version using the semantic versioning scheme (e.g. 1.0.1, 2.0). If a version is not input, the method tries finding the latest dataset version with the specified dataset_name and dataset_project and auto-increments the version number.

Use the output_uri parameter to specify a network storage target to upload the dataset files, and associated information (such as previews) to (e.g. s3://bucket/data, gs://bucket/data, azure://bucket/data, file:///mnt/share/data). By default, the dataset uploads to ClearML's file server. The output_uri parameter of the Dataset.upload method overrides this parameter’s value.

The created dataset inherits the content of the parent_datasets. When multiple dataset parents are listed, they are merged in order of specification. Each parent overrides any overlapping files from a previous parent dataset.

Dataset.squash()#

To improve deep dataset DAG storage and speed, dataset squashing was introduced. The Dataset.squash class method generates a new dataset by squashing a set of dataset versions, and merging down all changes introduced in their lineage DAG, creating a new, flat, independent version.

The datasets being squashed into a single dataset can be specified by their IDs or by project & name pairs.

# option 1 - list dataset IDs
squashed_dataset_1 = Dataset.squash(
dataset_name='squashed dataset\'s name',
dataset_ids=[DS1_ID, DS2_ID, DS3_ID]
)
# option 2 - list project and dataset pairs
squashed_dataset_2 = Dataset.squash(
dataset_name='squashed dataset 2',
dataset_project_name_pairs=[('dataset1 project', 'dataset1 name'),
('dataset2 project', 'dataset2 name')]
)

In addition, the target storage location for the squashed dataset can be specified using the output_uri parameter of the Dataset.squash method.

Accessing Datasets#

Once a dataset has been created and uploaded to a server, the dataset can be accessed programmatically from anywhere.

Use the Dataset.get class method to access a specific Dataset object, by providing any of the dataset’s following attributes: dataset ID, project, name, tags, and or version. If multiple datasets match the query, the most recent one is returned.

dataset = Dataset.get(
dataset_id=None,
dataset_project="Example Project",
dataset_name="Example Dataset",
dataset_tags="my tag",
dataset_version="1.2",
only_completed=True,
only_published=False,
)

Pass auto_create=True, and a dataset will be created on-the-fly with the input attributes (project name, dataset name, and tags) if no datasets match the query.

In cases where you use a dataset in a task (e.g. consuming a dataset), you can have its ID stored in the task’s hyper parameters: pass alias=<dataset_alias_string>, and the task using the dataset will store the dataset’s ID in the dataset_alias_string parameter under the Datasets hyper parameters section. This way you can easily track which dataset the task is using. If you use alias with overridable=True, you can override the dataset ID from the UI’s CONFIGURATION > HYPER PARAMETERS > Datasets section, allowing you to change the dataset used when running a task remotely.

In case you want to get a modifiable dataset, you can get a newly created mutable dataset with the current one as its parent, by passing writable_copy=True.

Once a specific dataset object has been obtained, get a local copy of the dataset using one of the following options:

  • Dataset.get_local_copy() - get a read-only local copy of an entire dataset. This method returns a path to the dataset in local cache (downloading the dataset if it is not already in cache).
  • Dataset.get_mutable_local_copy() - get a writable local copy of an entire dataset. This method downloads the dataset to a specific folder (non-cached), specified with the target_folder parameter. If the specified folder already has contents, specify whether to overwrite its contents with the dataset contents, using the overwrite parameter.

ClearML supports parallel downloading of datasets. Use the max_workers parameter of the Dataset.get_local_copy or Dataset.get_mutable_copy methods to specify the number of threads to use when downloading the dataset. By default, it’s the number of your machine’s logical cores.

Modifying Datasets#

Once a dataset has been created, its contents can be modified and replaced. When your data is changed, you can add updated files or remove unnecessary files.

add_files()#

To add local files or folders into the current dataset, use the Dataset.add_files method.

If a file is already in a dataset, but it has been modified, it can be added again, and ClearML will upload the file diff.

dataset = Dataset.create(dataset_name="my dataset", dataset_project="example project")
dataset.add_files(path="path/to/folder_or_file")

There is an option to add a set of files based on wildcard matching of a single string or a list of strings, using the wildcard parameter. Specify whether to match the wildcard files recursively using the recursive parameter.

For example:

dataset.add_files(
path="path/to/folder",
wildcard="~/data/*.jpg",
recursive=True
)

add_external_files()#

To add files or folders to the current dataset, leaving them in their original location, use the Dataset.add_external_files method. Input the source_url argument, which can be a link from cloud storage (s3://, gs://, azure://) or local / network storage (file://).

dataset = Dataset.create(dataset_name="my dataset", dataset_project="example project")
dataset.add_external_files(
source_url="s3://my/bucket/path_to_folder_or_file",
dataset_path="/my_dataset/new_folder/"
)

There is an option to add a set of files based on wildcard matching of a single string or a list of wildcards, using the wildcard parameter. Specify whether to match the wildcard files recursively using the recursive parameter.

# Add all jpg files located in s3 bucket called "my_bucket" to the dataset:
dataset.add_external_files(
source_url="s3://my/bucket/",
wildcard = "*.jpg",
dataset_path="/my_dataset/new_folder/"
)

remove_files()#

To remove files from a current dataset, use the Dataset.remove_files method. Input the path to the folder or file to be removed in the dataset_path parameter. The path is relative to the dataset. To remove links, specify their URL (e.g. s3://bucket/file).

There is also an option to input a wildcard into dataset_path in order to remove a set of files matching the wildcard. Set the recursive parameter to True in order to match all wildcard files recursively

For example:

dataset.remove_files(dataset_path="*.csv", recursive=True)

Uploading Files#

To upload the dataset files to network storage, use the Dataset.upload method.

Use the output_url parameter to specify storage target, such as S3 / GS / Azure (e.g. s3://bucket/data, gs://bucket/data, azure://bucket/data , /mnt/share/data). By default, the dataset uploads to ClearML's file server. This target storage overrides the output_uri value of the Dataset.create method.

ClearML supports parallel uploading of datasets. Use the max_workers parameter to specify the number of threads to use when uploading the dataset. By default, it’s the number of your machine’s logical cores.

Dataset files must be uploaded before a dataset is finalized.

Finalizing a Dataset#

Use the Dataset.finalize method to close the current dataset. This marks the dataset task as Completed, at which point, the dataset can no longer be modified.

Before closing a dataset, its files must first be uploaded.

Syncing Local Storage#

Use the Dataset.sync_folder method in order to update a dataset according to a specific folder's content changes. Specify the folder to sync with the local_path parameter (the method assumes all files within the folder and recursive).

This method is useful in the case where there's a single point of truth, either a local or network folder, that gets updated periodically. The folder changes will be reflected in a new dataset version. This method saves time since you don't have to manually update (add / remove) files in a dataset.

Deleting Datasets#

Delete a dataset using the Dataset.delete class method. Input any of the attributes of the dataset(s) you want to delete, including ID, project name, version, and/or dataset name. Multiple datasets matching the query will raise an exception, unless you pass entire_dataset=True and force=True. In this case, all matching datasets will be deleted.

If a dataset is a parent to a dataset(s), you must pass force=True in order to delete it.

warning

Deleting a parent dataset may cause child datasets to lose data!

Dataset.delete(
dataset_id=None,
dataset_project="example project",
dataset_name="example dataset",
force=False,
dataset_version="3.0",
entire_dataset=False
)

Renaming Datasets#

Rename a dataset using the Dataset.rename class method. All the datasets with the given dataset_project and dataset_name will be renamed.

Dataset.rename(
new_dataset_name="New name",
dataset_project="Example project",
dataset_name="Example dataset",
)

Moving Datasets to Another Project#

Move a dataset to another project using the Dataset.move_to_project class method. All the datasets with the given dataset_project and dataset_name will be moved to the new dataset project.

Dataset.move_to_project(
new_dataset_project="New project",
dataset_project="Example project",
dataset_name="Example dataset",
)