Dataset

class Dataset()

Do not use directly! Use Dataset.create(…) or Dataset.get(…) instead.

add_external_files

add_external_files(source_url, wildcard=None, dataset_path=None, recursive=True, verbose=False, max_workers=None)

Adds external files or folders to the current dataset. External file links can be from cloud storage (s3://, gs://, azure://), local / network storage (file://) or http(s)// files. Calculates file size for each file and compares against parent.

A few examples:

Add file.jpg to the dataset. When retrieving a copy of the entire dataset (see dataset.get_local_copy()). This file will be located in “./my_dataset/new_folder/file.jpg”. add_external_files(source_url=”s3://my_bucket/stuff/file.jpg”, dataset_path=”/my_dataset/new_folder/”)
Add all jpg files located in s3 bucket called “my_bucket” to the dataset. add_external_files(source_url=”s3://my/bucket/”, wildcard = “*.jpg”, dataset_path=”/my_dataset/new_folder/”)
Add the entire content of “remote_folder” to the dataset. add_external_files(source_url=”s3://bucket/remote_folder/”, dataset_path=”/my_dataset/new_folder/”)
Add the local file “/folder/local_file.jpg” to the dataset. add_external_files(source_url=”file:///folder/local_file.jpg”, dataset_path=”/my_dataset/new_folder/”)
Parameters
- source_url (Union[str, Sequence[str]]) – Source url link (e.g. s3://bucket/folder/path) or list/tuple of links to add to the dataset (e.g. [s3://bucket/folder/file.csv, http://web.com/file.txt])
- wildcard (Union[str, Sequence[str], None]) – add only specific set of files. Wildcard matching, can be a single string or a list of wildcards.
- dataset_path (Union[str, Sequence[str], None]) – The location in the dataset where the file will be downloaded into, or list/touple of locations (if list/touple, it must be the same length as source_url). e.g: for source_url=’s3://bucket/remote_folder/image.jpg’ and dataset_path=’s3_files’, ‘image.jpg’ will be downloaded to ‘s3_files/image.jpg’ (relative path to the dataset). For source_url=[‘s3://bucket/remote_folder/image.jpg’, ‘s3://bucket/remote_folder/image2.jpg’] and dataset_path=[‘s3_files’, ‘s3_files_2’], ‘image.jpg’ will be downloaded to ‘s3_files/image.jpg’ and ‘image2.jpg’ will be downloaded to ‘s3_files_2/image2.jpg’ (relative path to the dataset).
- recursive (bool) – If True, match all wildcard files recursively
- verbose (bool) – If True, print to console files added/modified
- max_workers (Optional[int]) – The number of threads to add the external files with. Useful when source_url is a sequence. Defaults to the number of logical cores
Return type
int
Returns
Number of file links added

add_files

add_files(path, wildcard=None, local_base_folder=None, dataset_path=None, recursive=True, verbose=False, max_workers=None)

Add a folder into the current dataset. calculate file hash, and compare against parent, mark files to be uploaded

Parameters
- path (Union [ str , Path , _Path ] ) – Add a folder/file to the dataset
- wildcard (Optional [ Union [ str , Sequence [ str ] ] ] ) – add only specific set of files. Wildcard matching, can be a single string or a list of wildcards.
- local_base_folder (Optional [ str ] ) – files will be located based on their relative path from local_base_folder
- dataset_path (Optional [ str ] ) – where in the dataset the folder/files should be located
- recursive (bool ) – If True, match all wildcard files recursively
- verbose (bool ) – If True, print to console files added/modified
- max_workers (Optional [ int ] ) – The number of threads to add the files with. Defaults to the number of logical cores
Return type
()
Returns
number of files added

add_tags

add_tags(tags)

Add Tags to this dataset. Old tags are not deleted. When executing a Task (experiment) remotely, this method has no effect.

Parameters
tags (Union[Sequence[str], str]) – A list of tags which describe the Task to add.
Return type
None

Dataset.create

classmethod create(dataset_name=None, dataset_project=None, dataset_tags=None, parent_datasets=None, use_current_task=False, dataset_version=None, output_uri=None, description=None)

Create a new dataset. Multiple dataset parents are supported. Merging of parent datasets is done based on the order, where each one can override overlapping files in the previous parent

Parameters
- dataset_name (Optional[str]) – Naming the new dataset
- dataset_project (Optional[str]) – Project containing the dataset. If not specified, infer project name form parent datasets
- dataset_tags (Optional[Sequence[str]]) – Optional, list of tags (strings) to attach to the newly created Dataset
- parent_datasets (Optional[Sequence[Union[str, Dataset]]]) – Expand a parent dataset by adding/removing files
- use_current_task (bool) – False (default), a new Dataset task is created. If True, the dataset is created on the current Task.
- dataset_version (Optional[str]) – Version of the new dataset. If not set, try to find the latest version of the dataset with given dataset_name and dataset_project and auto-increment it.
- output_uri (Optional[str]) – Location to upload the datasets file to, including preview samples.
  The following are examples of output_uri values for the supported locations:
  - A shared folder: /mnt/share/folder
  - S3: s3://bucket/folder
  - Google Cloud Storage: gs://bucket-name/folder
  - Azure Storage: azure://company.blob.core.windows.net/folder/
  - Default file server: None
- description (Optional[str]) – Description of the dataset
Return type
ForwardRef
Returns
Newly created Dataset object

Dataset.delete

classmethod delete(dataset_id=None, dataset_project=None, dataset_name=None, force=False, dataset_version=None, entire_dataset=False, shallow_search=False, delete_files=True, delete_external_files=False)

Delete the dataset(s). If multiple datasets match the parameters, raise an Exception or move the entire dataset if entire_dataset is True and force is True

Parameters
- dataset_id – The ID of the dataset(s) to be deleted
- dataset_project – The project the dataset(s) to be deleted belong(s) to
- dataset_name – The name of the dataset(s) to be deleted
- force – If True, deleted the dataset(s) even when being used. Also required to be set to True when entire_dataset is set.
- dataset_version – The version of the dataset(s) to be deleted
- entire_dataset – If True, delete all datasets that match the given dataset_project, dataset_name, dataset_version. Note that force has to be True if this parameter is True
- shallow_search – If True, search only the first 500 results (first page)
- delete_files – Delete all local files in the dataset (from the ClearML file server), as well as all artifacts related to the dataset.
- delete_external_files – Delete all external files in the dataset (from their external storage)
Return type
()

file_entries_dict

property file_entries_dict

Notice this call returns an internal representation, do not modify! :rtype: Mapping[str, FileEntry] :return: dict with relative file path as key, and FileEntry as value

finalize

finalize(verbose=False, raise_on_error=True, auto_upload=False)

Finalize the dataset publish dataset Task. Upload must first be called to verify that there are no pending uploads. If files do need to be uploaded, it throws an exception (or return False)

Parameters
- verbose (bool) – If True, print verbose progress report
- raise_on_error (bool) – If True, raise exception if dataset finalizing failed
- auto_upload (bool) – Automatically upload dataset if not called yet, will upload to default location.
Return type
bool

Dataset.get

classmethod get(dataset_id=None, dataset_project=None, dataset_name=None, dataset_tags=None, only_completed=False, only_published=False, include_archived=False, auto_create=False, writable_copy=False, dataset_version=None, alias=None, overridable=False, shallow_search=False, kwargs)**

Get a specific Dataset. If multiple datasets are found, the dataset with the highest semantic version is returned. If no semantic version is found, the most recently updated dataset is returned. This functions raises an Exception in case no dataset can be found and the auto_create=True flag is not set

Parameters
- dataset_id (Optional[str]) – Requested dataset ID
- dataset_project (Optional[str]) – Requested dataset project name
- dataset_name (Optional[str]) – Requested dataset name
- dataset_tags (Optional[Sequence[str]]) – Requested dataset tags (list of tag strings)
- only_completed (bool) – Return only if the requested dataset is completed or published
- only_published (bool) – Return only if the requested dataset is published
- include_archived (bool) – Include archived tasks and datasets also
- auto_create (bool) – Create a new dataset if it does not exist yet
- writable_copy (bool) – Get a newly created mutable dataset with the current one as its parent, so new files can be added to the instance.
- dataset_version (Optional[str]) – Requested version of the Dataset
- alias (Optional[str]) – Alias of the dataset. If set, the ‘alias : dataset ID’ key-value pair will be set under the hyperparameters section ‘Datasets’
- overridable (bool) – If True, allow overriding the dataset ID with a given alias in the hyperparameters section. Useful when one wants to change the dataset used when running a task remotely. If the alias parameter is not set, this parameter has no effect
- shallow_search (bool) – If True, search only the first 500 results (first page)
Return type
ForwardRef
Returns
Dataset object

get_default_storage

get_default_storage()

Return the default storage location of the dataset

Return type
Optional[str]
Returns
URL for the default storage location

get_dependency_graph

get_dependency_graph()

return the DAG of the dataset dependencies (all previous dataset version and their parents)

Example:

{
    'current_dataset_id': ['parent_1_id', 'parent_2_id'],
    'parent_2_id': ['parent_1_id'],
    'parent_1_id': [],
}

Returns
dict representing the genealogy dag graph of the current dataset

get_local_copy

get_local_copy(use_soft_links=None, part=None, num_parts=None, raise_on_error=True, max_workers=None)

Return a base folder with a read-only (immutable) local copy of the entire dataset download and copy / soft-link, files from all the parent dataset versions. The dataset needs to be finalized

Parameters
- use_soft_links (Optional[bool]) – If True, use soft links, default False on windows True on Posix systems
- part (Optional[int]) – Optional, if provided only download the selected part (index) of the Dataset. First part number is 0 and last part is num_parts-1 Notice, if num_parts is not provided, number of parts will be equal to the total number of chunks (i.e. sum over all chunks from the specified Dataset including all parent Datasets). This argument is passed to parent datasets, as well as the implicit num_parts, allowing users to get a partial copy of the entire dataset, for multi node/step processing.
- num_parts (Optional[int]) – Optional, if specified, normalize the number of chunks stored to the requested number of parts. Notice that the actual chunks used per part are rounded down. Example: Assuming total 8 chunks for this dataset (including parent datasets), and num_parts=5, the chunk index used per parts would be: part=0 -> chunks[0,5], part=1 -> chunks[1,6], part=2 -> chunks[2,7], part=3 -> chunks[3, ]
- raise_on_error (bool) – If True, raise exception if dataset merging failed on any file
- max_workers (Optional[int]) – Number of threads to be spawned when getting the dataset copy. Defaults to the number of logical cores.
Return type
str
Returns
A base folder for the entire dataset

get_logger

get_logger()

Return a Logger object for the Dataset, allowing users to report statistics metrics and debug samples on the Dataset itself

Return type
Logger
Returns
Logger object

get_metadata

get_metadata(metadata_name='metadata')

Get attached metadata back in its original format. Will return None if none was found.

Return type
Optional[numpy.array, pd.DataFrame, dict, str, bool]
Parameters
metadata_name (str ) –

get_mutable_local_copy

get_mutable_local_copy(target_folder, overwrite=False, part=None, num_parts=None, raise_on_error=True, max_workers=None)

return a base folder with a writable (mutable) local copy of the entire dataset download and copy / soft-link, files from all the parent dataset versions

Parameters
- target_folder (Union[Path, Path, str]) – Target folder for the writable copy
- overwrite (bool) – If True, recursively delete the target folder before creating a copy. If False (default) and target folder contains files, raise exception or return None
- part (Optional[int]) – Optional, if provided only download the selected part (index) of the Dataset. First part number is 0 and last part is num_parts-1 Notice, if num_parts is not provided, number of parts will be equal to the total number of chunks (i.e. sum over all chunks from the specified Dataset including all parent Datasets). This argument is passed to parent datasets, as well as the implicit num_parts, allowing users to get a partial copy of the entire dataset, for multi node/step processing.
- num_parts (Optional[int]) – Optional, if specified, normalize the number of chunks stored to the requested number of parts. Notice that the actual chunks used per part are rounded down. Example: Assuming total 8 chunks for this dataset (including parent datasets), and num_parts=5, the chunk index used per parts would be: part=0 -> chunks[0,5], part=1 -> chunks[1,6], part=2 -> chunks[2,7], part=3 -> chunks[3, ]
- raise_on_error (bool) – If True, raise exception if dataset merging failed on any file
- max_workers (Optional[int]) – Number of threads to be spawned when getting the dataset copy. Defaults to the number of logical cores.
Return type
Optional[str]
Returns
The target folder containing the entire dataset

get_num_chunks

get_num_chunks(include_parents=True)

Return the number of chunks stored on this dataset (it does not imply on the number of chunks parent versions store)

Parameters
include_parents (bool) – If True (default), return the total number of chunks from this version and all parent versions. If False, only return the number of chunks we stored on this specific version.
Return type
int
Returns
Number of chunks stored on the dataset.

get_offline_mode_folder

get_offline_mode_folder()

Return the folder where all the dataset data is stored in the offline session.

Return type
Optional[Path]
Returns
Path object, local folder

Dataset.import_offline_session

classmethod import_offline_session(session_folder_zip, upload=True, finalize=False)

Import an offline session of a dataset. Includes repository details, installed packages, artifacts, logs, metric and debug samples.

Parameters
- session_folder_zip (str) – Path to a folder containing the session, or zip-file of the session folder.
- upload (bool) – If True, upload the dataset’s data
- finalize (bool) – If True, finalize the dataset
Return type
str
Returns
The ID of the imported dataset

is_dirty

is_dirty()

Return True if the dataset has pending uploads (i.e. we cannot finalize it)

Return type
bool
Returns
Return True means dataset has pending uploads, call ‘upload’ to start an upload process.

is_final

is_final()

Return True if the dataset was finalized and cannot be changed any more.

Return type
bool
Returns
True if dataset if final

Dataset.is_offline

classmethod is_offline()

Return offline-mode state, If in offline-mode, no communication to the backend is enabled.

Return type
bool
Returns
boolean offline-mode state

link_entries_dict

property link_entries_dict

Notice this call returns an internal representation, do not modify!

Return type
Mapping[str, LinkEntry]
Returns
dict with relative file path as key, and LinkEntry as value

list_added_files

list_added_files(dataset_id=None)

return a list of files added when comparing to a specific dataset_id

Parameters
dataset_id (Optional[str]) – dataset ID (str) to compare against, if None is given compare against the parents datasets
Return type
List[str]
Returns
List of files with relative path (files might not be available locally until get_local_copy() is called)

Dataset.list_datasets

classmethod list_datasets(dataset_project=None, partial_name=None, tags=None, ids=None, only_completed=True, recursive_project_search=True, include_archived=True)

Query list of dataset in the system

Parameters
- dataset_project (Optional[str]) – Specify dataset project name
- partial_name (Optional[str]) – Specify partial match to a dataset name. This method supports regular expressions for name matching (if you wish to match special characters and avoid any regex behaviour, use re.escape())
- tags (Optional[Sequence[str]]) – Specify user tags
- ids (Optional[Sequence[str]]) – List specific dataset based on IDs list
- only_completed (bool) – If False, return datasets that are still in progress (uploading/edited etc.)
- recursive_project_search (bool) – If True and the dataset_project argument is set, search inside subprojects as well. If False, don’t search inside subprojects (except for the special .datasets subproject)
- include_archived (bool) – If True, include archived datasets as well.
Return type
List[dict]
Returns
List of dictionaries with dataset information Example: [{‘name’: name, ‘project’: project name, ‘id’: dataset_id, ‘created’: date_created},]

list_files

list_files(dataset_path=None, recursive=True, dataset_id=None)

returns a list of files in the current dataset If dataset_id is provided, return a list of files that remained unchanged since the specified dataset_id

Parameters
- dataset_path (Optional[str]) – Only match files matching the dataset_path (including wildcards). Example: ‘folder/sub/*.json’
- recursive (bool) – If True (default), matching dataset_path recursively
- dataset_id (Optional[str]) – Filter list based on the dataset ID containing the latest version of the file. Default: None, do not filter files based on parent dataset.
Return type
List[str]
Returns
List of files with relative path (files might not be available locally until get_local_copy() is called)

list_modified_files

list_modified_files(dataset_id=None)

return a list of files modified when comparing to a specific dataset_id

Parameters
dataset_id (Optional[str]) – dataset ID (str) to compare against, if None is given compare against the parents datasets
Return type
List[str]
Returns
List of files with relative path (files might not be available locally until get_local_copy() is called)

list_removed_files

list_removed_files(dataset_id=None)

return a list of files removed when comparing to a specific dataset_id

Parameters
dataset_id (Optional[str]) – dataset ID (str) to compare against, if None is given compare against the parents datasets
Return type
List[str]
Returns
List of files with relative path (files might not be available locally until get_local_copy() is called)

Dataset.move_to_project

classmethod move_to_project(new_dataset_project, dataset_project, dataset_name)

Move the dataset to another project.

Parameters
- new_dataset_project – New project to move the dataset(s) to
- dataset_project – Project of the dataset(s) to move to new project
- dataset_name – Name of the dataset(s) to move to new project
Return type
()

publish

publish(raise_on_error=True)

Publish the dataset If dataset is not finalize, throw exception

Parameters
raise_on_error (bool) – If True, raise exception if dataset publishing failed
Return type
bool

remove_files

remove_files(dataset_path=None, recursive=True, verbose=False)

Remove files from the current dataset

Parameters
- dataset_path (Optional[str]) – Remove files from the dataset. The path is always relative to the dataset (e.g ‘folder/file.bin’). External files can also be removed by their links (e.g. ‘s3://bucket/file’)
- recursive (bool) – If True, match all wildcard files recursively
- verbose (bool) – If True, print to console files removed
Return type
int
Returns
Number of files removed

Dataset.rename

classmethod rename(new_dataset_name, dataset_project, dataset_name)

Rename the dataset.

Parameters
- new_dataset_name – The new name of the datasets to be renamed
- dataset_project – The project the datasets to be renamed belongs to
- dataset_name – The name of the datasets (before renaming)
Return type
()

set_description

set_description(description)

Set description of the dataset

Parameters
description (str ) – Description to be set
Return type
()

set_metadata

set_metadata(metadata, metadata_name='metadata', ui_visible=True)

Attach a user-defined metadata to the dataset. Check Task.upload_artifact for supported types. If type is Pandas Dataframes, optionally make it visible as a table in the UI.

Return type
()
Parameters
- metadata (Union [ numpy.array , pd.DataFrame , Dict [ str , Any ] ] ) –
- metadata_name (str ) –
- ui_visible (bool ) –

Dataset.set_offline

classmethod set_offline(offline_mode=False)

Set offline mode, where all data and logs are stored into local folder, for later transmission

Parameters
offline_mode (bool) – If True, offline-mode is turned on, and no communication to the backend is enabled.
Return type
None

Dataset.squash

classmethod squash(dataset_name, dataset_ids=None, dataset_project_name_pairs=None, output_url=None)

Generate a new dataset from the squashed set of dataset versions. If a single version is given it will squash to the root (i.e. create single standalone version) If a set of versions are given it will squash the versions diff into a single version

Parameters
- dataset_name (str ) – Target name for the newly generated squashed dataset
- dataset_ids (Optional [ Sequence [ Union [ str , Dataset ] ] ] ) – List of dataset IDs (or objects) to squash. Notice order does matter. The versions are merged from first to last.
- dataset_project_name_pairs (Optional [ Sequence [ ( str , str ) ] ] ) – List of pairs (project_name, dataset_name) to squash. Notice order does matter. The versions are merged from first to last.
- output_url (Optional [ str ] ) – Target storage for the compressed dataset (default: file server) Examples: s3://bucket/data, gs://bucket/data , azure://bucket/data , /mnt/share/data
Return type
“Dataset”
Returns
Newly created dataset object.

sync_folder

sync_folder(local_path, dataset_path=None, verbose=False)

Synchronize the dataset with a local folder. The dataset is synchronized from the relative_base_folder (default: dataset root) and deeper with the specified local path. Note that if a remote file is identified as being modified when syncing, it will be added as a FileEntry, ready to be uploaded to the ClearML server. This version of the file is considered “newer” and it will be downloaded instead of the one stored at its remote address when calling Dataset.get_local_copy().

Parameters
- local_path (Union [ Path , _Path , str ] ) – Local folder to sync (assumes all files and recursive)
- dataset_path (Union [ Path , _Path , str ] ) – Target dataset path to sync with (default the root of the dataset)
- verbose (bool ) – If True, print to console files added/modified/removed
Return type
(int, int, int)
Returns
number of files removed, number of files modified/added

update_changed_files

update_changed_files(num_files_added=None, num_files_modified=None, num_files_removed=None)

Update the internal state keeping track of added, modified and removed files.

Parameters
- num_files_added – Amount of files added when compared to the parent dataset
- num_files_modified – Amount of files with the same name but a different hash when compared to the parent dataset
- num_files_removed – Amount of files removed when compared to the parent dataset

upload

upload(show_progress=True, verbose=False, output_url=None, compression=None, chunk_size=None, max_workers=None, retries=3)

Start file uploading, the function returns when all files are uploaded.

Parameters
- show_progress (bool ) – If True, show upload progress bar
- verbose (bool ) – If True, print verbose progress report
- output_url (Optional [ str ] ) – Target storage for the compressed dataset (default: file server) Examples: s3://bucket/data, gs://bucket/data , azure://bucket/data , /mnt/share/data
- compression (Optional [ str ] ) – Compression algorithm for the Zipped dataset file (default: ZIP_DEFLATED)
- chunk_size (int ) – Artifact chunk size (MB) for the compressed dataset, if not provided (None) use the default chunk size (512mb). If -1 is provided, use a single zip artifact for the entire dataset change-set (old behaviour)
- max_workers (Optional [ int ] ) – Numbers of threads to be spawned when zipping and uploading the files.
  If None (default) it will be set to:
  - 1: if the upload destination is a cloud provider (‘s3’, ‘gs’, ‘azure’)
  - number of logical cores: otherwise
- retries (int ) – Number of retries before failing to upload each zip. If 0, the upload is not retried.
Raise
If the upload failed (i.e. at least one zip failed to upload), raise a ValueError
Return type
()

verify_dataset_hash

verify_dataset_hash(local_copy_path=None, skip_hash=False, verbose=False)

Verify the current copy of the dataset against the stored hash

Parameters
- local_copy_path (Optional[str]) – Specify local path containing a copy of the dataset, If not provide use the cached folder
- skip_hash (bool) – If True, skip hash checks and verify file size only
- verbose (bool) – If True, print errors while testing dataset files hash
Return type
List[str]
Returns
List of files with unmatched hashes

class Dataset()​

add_external_files​

add_files​

add_tags​

Dataset.create​

Dataset.delete​

file_entries_dict​

finalize​

Dataset.get​

get_default_storage​

get_dependency_graph​

get_local_copy​

get_logger​

get_metadata​

get_mutable_local_copy​

get_num_chunks​

get_offline_mode_folder​

Dataset.import_offline_session​

is_dirty​

is_final​

Dataset.is_offline​

link_entries_dict​

list_added_files​

Dataset.list_datasets​

list_files​

list_modified_files​

list_removed_files​

Dataset.move_to_project​

publish​

remove_files​

Dataset.rename​

set_description​

set_metadata​

Dataset.set_offline​

Dataset.squash​

sync_folder​

update_changed_files​

upload​

verify_dataset_hash​