Dataset
class Dataset()
Do not use directly! Use Dataset.create(…) or Dataset.get(…) instead.
add_external_files
add_external_files(source_url, wildcard=None, dataset_path=None, recursive=True, verbose=False, max_workers=None)
Adds external files or folders to the current dataset. External file links can be from cloud storage (s3://, gs://, azure://), local / network storage (file://) or http(s)// files. Calculates file size for each file and compares against parent.
A few examples:
Add file.jpg to the dataset. When retrieving a copy of the entire dataset (see dataset.get_local_copy()). This file will be located in “./my_dataset/new_folder/file.jpg”. add_external_files(source_url=”s3://my_bucket/stuff/file.jpg”, dataset_path=”/my_dataset/new_folder/”)
Add all jpg files located in s3 bucket called “my_bucket” to the dataset. add_external_files(source_url=”s3://my/bucket/”, wildcard = “*.jpg”, dataset_path=”/my_dataset/new_folder/”)
Add the entire content of “remote_folder” to the dataset. add_external_files(source_url=”s3://bucket/remote_folder/”, dataset_path=”/my_dataset/new_folder/”)
Add the local file “/folder/local_file.jpg” to the dataset. add_external_files(source_url=”file:///folder/local_file.jpg”, dataset_path=”/my_dataset/new_folder/”)
Parameters
source_url (
Union
[str
,Sequence
[str
]]) – Source url link (e.g. s3://bucket/folder/path) or list/tuple of links to add to the dataset (e.g. [s3://bucket/folder/file.csv, http://web.com/file.txt])wildcard (
Union
[str
,Sequence
[str
],None
]) – add only specific set of files. Wildcard matching, can be a single string or a list of wildcards.dataset_path (
Union
[str
,Sequence
[str
],None
]) – The location in the dataset where the file will be downloaded into, or list/touple of locations (if list/touple, it must be the same length assource_url
). e.g: for source_url=’s3://bucket/remote_folder/image.jpg’ and dataset_path=’s3_files’, ‘image.jpg’ will be downloaded to ‘s3_files/image.jpg’ (relative path to the dataset). For source_url=[‘s3://bucket/remote_folder/image.jpg’, ‘s3://bucket/remote_folder/image2.jpg’] and dataset_path=[‘s3_files’, ‘s3_files_2’], ‘image.jpg’ will be downloaded to ‘s3_files/image.jpg’ and ‘image2.jpg’ will be downloaded to ‘s3_files_2/image2.jpg’ (relative path to the dataset).recursive (
bool
) – If True, match all wildcard files recursivelyverbose (
bool
) – If True, print to console files added/modifiedmax_workers (
Optional
[int
]) – The number of threads to add the external files with. Useful when source_url is a sequence. Defaults to the number of logical cores
Return type
int
Returns
Number of file links added
add_files
add_files(path, wildcard=None, local_base_folder=None, dataset_path=None, recursive=True, verbose=False, max_workers=None)
Add a folder into the current dataset. calculate file hash, and compare against parent, mark files to be uploaded
Parameters
path (Union [ str , Path , _Path ] ) – Add a folder/file to the dataset
wildcard (Optional [ Union [ str , Sequence [ str ] ] ] ) – add only specific set of files. Wildcard matching, can be a single string or a list of wildcards.
local_base_folder (Optional [ str ] ) – files will be located based on their relative path from local_base_folder
dataset_path (Optional [ str ] ) – where in the dataset the folder/files should be located
recursive (bool ) – If True, match all wildcard files recursively
verbose (bool ) – If True, print to console files added/modified
max_workers (Optional [ int ] ) – The number of threads to add the files with. Defaults to the number of logical cores
Return type
()
Returns
number of files added
add_tags
add_tags(tags)
Add Tags to this dataset. Old tags are not deleted. When executing a Task (experiment) remotely, this method has no effect.
Parameters
tags (
Union
[Sequence
[str
],str
]) – A list of tags which describe the Task to add.Return type
None
Dataset.create
classmethod create(dataset_name=None, dataset_project=None, dataset_tags=None, parent_datasets=None, use_current_task=False, dataset_version=None, output_uri=None, description=None)
Create a new dataset. Multiple dataset parents are supported. Merging of parent datasets is done based on the order, where each one can override overlapping files in the previous parent
Parameters
dataset_name (
Optional
[str
]) – Naming the new datasetdataset_project (
Optional
[str
]) – Project containing the dataset. If not specified, infer project name form parent datasetsdataset_tags (
Optional
[Sequence
[str
]]) – Optional, list of tags (strings) to attach to the newly created Datasetparent_datasets (
Optional
[Sequence
[Union
[str
,Dataset
]]]) – Expand a parent dataset by adding/removing filesuse_current_task (
bool
) – False (default), a new Dataset task is created. If True, the dataset is created on the current Task.dataset_version (
Optional
[str
]) – Version of the new dataset. If not set, try to find the latest version of the dataset with given dataset_name and dataset_project and auto-increment it.output_uri (
Optional
[str
]) – Location to upload the datasets file to, including preview samples.The following are examples of
output_uri
values for the supported locations:A shared folder:
/mnt/share/folder
S3:
s3://bucket/folder
Google Cloud Storage:
gs://bucket-name/folder
Azure Storage:
azure://company.blob.core.windows.net/folder/
Default file server: None
description (
Optional
[str
]) – Description of the dataset
Return type
ForwardRef
Returns
Newly created Dataset object
Dataset.delete
classmethod delete(dataset_id=None, dataset_project=None, dataset_name=None, force=False, dataset_version=None, entire_dataset=False, shallow_search=False, delete_files=True, delete_external_files=False)
Delete the dataset(s). If multiple datasets match the parameters, raise an Exception or move the entire dataset if entire_dataset is True and force is True
Parameters
dataset_id – The ID of the dataset(s) to be deleted
dataset_project – The project the dataset(s) to be deleted belong(s) to
dataset_name – The name of the dataset(s) to be deleted
force – If True, deleted the dataset(s) even when being used. Also required to be set to True when entire_dataset is set.
dataset_version – The version of the dataset(s) to be deleted
entire_dataset – If True, delete all datasets that match the given dataset_project, dataset_name, dataset_version. Note that force has to be True if this parameter is True
shallow_search – If True, search only the first 500 results (first page)
delete_files – Delete all local files in the dataset (from the ClearML file server), as well as all artifacts related to the dataset.
delete_external_files – Delete all external files in the dataset (from their external storage)
Return type
()
file_entries_dict
property file_entries_dict
Notice this call returns an internal representation, do not modify!
:rtype: Mapping
[str
, FileEntry
]
:return: dict with relative file path as key, and FileEntry as value
finalize
finalize(verbose=False, raise_on_error=True, auto_upload=False)
Finalize the dataset publish dataset Task. Upload must first be called to verify that there are no pending uploads. If files do need to be uploaded, it throws an exception (or return False)
Parameters
verbose (
bool
) – If True, print verbose progress reportraise_on_error (
bool
) – If True, raise exception if dataset finalizing failedauto_upload (
bool
) – Automatically upload dataset if not called yet, will upload to default location.
Return type
bool
Dataset.get
classmethod get(dataset_id=None, dataset_project=None, dataset_name=None, dataset_tags=None, only_completed=False, only_published=False, include_archived=False, auto_create=False, writable_copy=False, dataset_version=None, alias=None, overridable=False, shallow_search=False, kwargs)**
Get a specific Dataset. If multiple datasets are found, the dataset with the
highest semantic version is returned. If no semantic version is found, the most recently
updated dataset is returned. This functions raises an Exception in case no dataset
can be found and the auto_create=True
flag is not set
Parameters
dataset_id (
Optional
[str
]) – Requested dataset IDdataset_project (
Optional
[str
]) – Requested dataset project namedataset_name (
Optional
[str
]) – Requested dataset namedataset_tags (
Optional
[Sequence
[str
]]) – Requested dataset tags (list of tag strings)only_completed (
bool
) – Return only if the requested dataset is completed or publishedonly_published (
bool
) – Return only if the requested dataset is publishedinclude_archived (
bool
) – Include archived tasks and datasets alsoauto_create (
bool
) – Create a new dataset if it does not exist yetwritable_copy (
bool
) – Get a newly created mutable dataset with the current one as its parent, so new files can be added to the instance.dataset_version (
Optional
[str
]) – Requested version of the Datasetalias (
Optional
[str
]) – Alias of the dataset. If set, the ‘alias : dataset ID’ key-value pair will be set under the hyperparameters section ‘Datasets’overridable (
bool
) – If True, allow overriding the dataset ID with a given alias in the hyperparameters section. Useful when one wants to change the dataset used when running a task remotely. If the alias parameter is not set, this parameter has no effectshallow_search (
bool
) – If True, search only the first 500 results (first page)
Return type
ForwardRef
Returns
Dataset object
get_default_storage
get_default_storage()
Return the default storage location of the dataset
Return type
Optional
[str
]Returns
URL for the default storage location
get_dependency_graph
get_dependency_graph()
return the DAG of the dataset dependencies (all previous dataset version and their parents)
Example:
{
'current_dataset_id': ['parent_1_id', 'parent_2_id'],
'parent_2_id': ['parent_1_id'],
'parent_1_id': [],
}
Returns
dict representing the genealogy dag graph of the current dataset
get_local_copy
get_local_copy(use_soft_links=None, part=None, num_parts=None, raise_on_error=True, max_workers=None)
Return a base folder with a read-only (immutable) local copy of the entire dataset download and copy / soft-link, files from all the parent dataset versions. The dataset needs to be finalized
Parameters
use_soft_links (
Optional
[bool
]) – If True, use soft links, default False on windows True on Posix systemspart (
Optional
[int
]) – Optional, if provided only download the selected part (index) of the Dataset. First part number is 0 and last part is num_parts-1 Notice, if num_parts is not provided, number of parts will be equal to the total number of chunks (i.e. sum over all chunks from the specified Dataset including all parent Datasets). This argument is passed to parent datasets, as well as the implicit num_parts, allowing users to get a partial copy of the entire dataset, for multi node/step processing.num_parts (
Optional
[int
]) – Optional, if specified, normalize the number of chunks stored to the requested number of parts. Notice that the actual chunks used per part are rounded down. Example: Assuming total 8 chunks for this dataset (including parent datasets), and num_parts=5, the chunk index used per parts would be: part=0 -> chunks[0,5], part=1 -> chunks[1,6], part=2 -> chunks[2,7], part=3 -> chunks[3, ]raise_on_error (
bool
) – If True, raise exception if dataset merging failed on any filemax_workers (
Optional
[int
]) – Number of threads to be spawned when getting the dataset copy. Defaults to the number of logical cores.
Return type
str
Returns
A base folder for the entire dataset
get_logger
get_logger()
Return a Logger object for the Dataset, allowing users to report statistics metrics and debug samples on the Dataset itself
Return type
Returns
Logger object
get_metadata
get_metadata(metadata_name='metadata')
Get attached metadata back in its original format. Will return None if none was found.
Return type
Optional[numpy.array, pd.DataFrame, dict, str, bool]
Parameters
metadata_name (str ) –
get_mutable_local_copy
get_mutable_local_copy(target_folder, overwrite=False, part=None, num_parts=None, raise_on_error=True, max_workers=None)
return a base folder with a writable (mutable) local copy of the entire dataset download and copy / soft-link, files from all the parent dataset versions
Parameters
target_folder (
Union
[Path
,Path
,str
]) – Target folder for the writable copyoverwrite (
bool
) – If True, recursively delete the target folder before creating a copy. If False (default) and target folder contains files, raise exception or return Nonepart (
Optional
[int
]) – Optional, if provided only download the selected part (index) of the Dataset. First part number is 0 and last part is num_parts-1 Notice, if num_parts is not provided, number of parts will be equal to the total number of chunks (i.e. sum over all chunks from the specified Dataset including all parent Datasets). This argument is passed to parent datasets, as well as the implicit num_parts, allowing users to get a partial copy of the entire dataset, for multi node/step processing.num_parts (
Optional
[int
]) – Optional, if specified, normalize the number of chunks stored to the requested number of parts. Notice that the actual chunks used per part are rounded down. Example: Assuming total 8 chunks for this dataset (including parent datasets), and num_parts=5, the chunk index used per parts would be: part=0 -> chunks[0,5], part=1 -> chunks[1,6], part=2 -> chunks[2,7], part=3 -> chunks[3, ]raise_on_error (
bool
) – If True, raise exception if dataset merging failed on any filemax_workers (
Optional
[int
]) – Number of threads to be spawned when getting the dataset copy. Defaults to the number of logical cores.
Return type
Optional
[str
]Returns
The target folder containing the entire dataset
get_num_chunks
get_num_chunks(include_parents=True)
Return the number of chunks stored on this dataset (it does not imply on the number of chunks parent versions store)
Parameters
include_parents (
bool
) – If True (default), return the total number of chunks from this version and all parent versions. If False, only return the number of chunks we stored on this specific version.Return type
int
Returns
Number of chunks stored on the dataset.
get_offline_mode_folder
get_offline_mode_folder()
Return the folder where all the dataset data is stored in the offline session.
Return type
Optional
[Path
]Returns
Path object, local folder
Dataset.import_offline_session
classmethod import_offline_session(session_folder_zip, upload=True, finalize=False)
Import an offline session of a dataset. Includes repository details, installed packages, artifacts, logs, metric and debug samples.
Parameters
session_folder_zip (
str
) – Path to a folder containing the session, or zip-file of the session folder.upload (
bool
) – If True, upload the dataset’s datafinalize (
bool
) – If True, finalize the dataset
Return type
str
Returns
The ID of the imported dataset
is_dirty
is_dirty()
Return True if the dataset has pending uploads (i.e. we cannot finalize it)
Return type
bool
Returns
Return True means dataset has pending uploads, call ‘upload’ to start an upload process.
is_final
is_final()
Return True if the dataset was finalized and cannot be changed any more.
Return type
bool
Returns
True if dataset if final
Dataset.is_offline
classmethod is_offline()
Return offline-mode state, If in offline-mode, no communication to the backend is enabled.
Return type
bool
Returns
boolean offline-mode state
link_entries_dict
property link_entries_dict
Notice this call returns an internal representation, do not modify!
Return type
Mapping
[str
,LinkEntry
]Returns
dict with relative file path as key, and LinkEntry as value
list_added_files
list_added_files(dataset_id=None)
return a list of files added when comparing to a specific dataset_id
Parameters
dataset_id (
Optional
[str
]) – dataset ID (str) to compare against, if None is given compare against the parents datasetsReturn type
List
[str
]Returns
List of files with relative path (files might not be available locally until get_local_copy() is called)
Dataset.list_datasets
classmethod list_datasets(dataset_project=None, partial_name=None, tags=None, ids=None, only_completed=True, recursive_project_search=True, include_archived=True)
Query list of dataset in the system
Parameters
dataset_project (
Optional
[str
]) – Specify dataset project namepartial_name (
Optional
[str
]) – Specify partial match to a dataset name. This method supports regular expressions for name matching (if you wish to match special characters and avoid any regex behaviour, use re.escape())tags (
Optional
[Sequence
[str
]]) – Specify user tagsids (
Optional
[Sequence
[str
]]) – List specific dataset based on IDs listonly_completed (
bool
) – If False, return datasets that are still in progress (uploading/edited etc.)recursive_project_search (
bool
) – If True and the dataset_project argument is set, search inside subprojects as well. If False, don’t search inside subprojects (except for the special .datasets subproject)include_archived (
bool
) – If True, include archived datasets as well.
Return type
List
[dict
]Returns
List of dictionaries with dataset information Example: [{‘name’: name, ‘project’: project name, ‘id’: dataset_id, ‘created’: date_created},]
list_files
list_files(dataset_path=None, recursive=True, dataset_id=None)
returns a list of files in the current dataset If dataset_id is provided, return a list of files that remained unchanged since the specified dataset_id
Parameters
dataset_path (
Optional
[str
]) – Only match files matching the dataset_path (including wildcards). Example: ‘folder/sub/*.json’recursive (
bool
) – If True (default), matching dataset_path recursivelydataset_id (
Optional
[str
]) – Filter list based on the dataset ID containing the latest version of the file. Default: None, do not filter files based on parent dataset.
Return type
List
[str
]Returns
List of files with relative path (files might not be available locally until get_local_copy() is called)
list_modified_files
list_modified_files(dataset_id=None)
return a list of files modified when comparing to a specific dataset_id
Parameters
dataset_id (
Optional
[str
]) – dataset ID (str) to compare against, if None is given compare against the parents datasetsReturn type
List
[str
]Returns
List of files with relative path (files might not be available locally until get_local_copy() is called)
list_removed_files
list_removed_files(dataset_id=None)
return a list of files removed when comparing to a specific dataset_id
Parameters
dataset_id (
Optional
[str
]) – dataset ID (str) to compare against, if None is given compare against the parents datasetsReturn type
List
[str
]Returns
List of files with relative path (files might not be available locally until get_local_copy() is called)
Dataset.move_to_project
classmethod move_to_project(new_dataset_project, dataset_project, dataset_name)
Move the dataset to another project.
Parameters
new_dataset_project – New project to move the dataset(s) to
dataset_project – Project of the dataset(s) to move to new project
dataset_name – Name of the dataset(s) to move to new project
Return type
()
publish
publish(raise_on_error=True)
Publish the dataset If dataset is not finalize, throw exception
Parameters
raise_on_error (
bool
) – If True, raise exception if dataset publishing failedReturn type
bool
remove_files
remove_files(dataset_path=None, recursive=True, verbose=False)
Remove files from the current dataset
Parameters
dataset_path (
Optional
[str
]) – Remove files from the dataset. The path is always relative to the dataset (e.g ‘folder/file.bin’). External files can also be removed by their links (e.g. ‘s3://bucket/file’)recursive (
bool
) – If True, match all wildcard files recursivelyverbose (
bool
) – If True, print to console files removed
Return type
int
Returns
Number of files removed
Dataset.rename
classmethod rename(new_dataset_name, dataset_project, dataset_name)
Rename the dataset.
Parameters
new_dataset_name – The new name of the datasets to be renamed
dataset_project – The project the datasets to be renamed belongs to
dataset_name – The name of the datasets (before renaming)
Return type
()
set_description
set_description(description)
Set description of the dataset
Parameters
description (str ) – Description to be set
Return type
()
set_metadata
set_metadata(metadata, metadata_name='metadata', ui_visible=True)
Attach a user-defined metadata to the dataset. Check Task.upload_artifact for supported types. If type is Pandas Dataframes, optionally make it visible as a table in the UI.
Return type
()
Parameters
metadata (Union [ numpy.array , pd.DataFrame , Dict [ str , Any ] ] ) –
metadata_name (str ) –
ui_visible (bool ) –
Dataset.set_offline
classmethod set_offline(offline_mode=False)
Set offline mode, where all data and logs are stored into local folder, for later transmission
Parameters
offline_mode (
bool
) – If True, offline-mode is turned on, and no communication to the backend is enabled.Return type
None
Dataset.squash
classmethod squash(dataset_name, dataset_ids=None, dataset_project_name_pairs=None, output_url=None)
Generate a new dataset from the squashed set of dataset versions. If a single version is given it will squash to the root (i.e. create single standalone version) If a set of versions are given it will squash the versions diff into a single version
Parameters
dataset_name (str ) – Target name for the newly generated squashed dataset
dataset_ids (Optional [ Sequence [ Union [ str , Dataset ] ] ] ) – List of dataset IDs (or objects) to squash. Notice order does matter. The versions are merged from first to last.
dataset_project_name_pairs (Optional [ Sequence [ ( str , str ) ] ] ) – List of pairs (project_name, dataset_name) to squash. Notice order does matter. The versions are merged from first to last.
output_url (Optional [ str ] ) – Target storage for the compressed dataset (default: file server) Examples: s3://bucket/data, gs://bucket/data , azure://bucket/data , /mnt/share/data
Return type
“Dataset”
Returns
Newly created dataset object.
sync_folder
sync_folder(local_path, dataset_path=None, verbose=False)
Synchronize the dataset with a local folder. The dataset is synchronized from the relative_base_folder (default: dataset root) and deeper with the specified local path. Note that if a remote file is identified as being modified when syncing, it will be added as a FileEntry, ready to be uploaded to the ClearML server. This version of the file is considered “newer” and it will be downloaded instead of the one stored at its remote address when calling Dataset.get_local_copy().
Parameters
local_path (Union [ Path , _Path , str ] ) – Local folder to sync (assumes all files and recursive)
dataset_path (Union [ Path , _Path , str ] ) – Target dataset path to sync with (default the root of the dataset)
verbose (bool ) – If True, print to console files added/modified/removed
Return type
(int, int, int)
Returns
number of files removed, number of files modified/added
update_changed_files
update_changed_files(num_files_added=None, num_files_modified=None, num_files_removed=None)
Update the internal state keeping track of added, modified and removed files.
Parameters
num_files_added – Amount of files added when compared to the parent dataset
num_files_modified – Amount of files with the same name but a different hash when compared to the parent dataset
num_files_removed – Amount of files removed when compared to the parent dataset
upload
upload(show_progress=True, verbose=False, output_url=None, compression=None, chunk_size=None, max_workers=None, retries=3)
Start file uploading, the function returns when all files are uploaded.
Parameters
show_progress (bool ) – If True, show upload progress bar
verbose (bool ) – If True, print verbose progress report
output_url (Optional [ str ] ) – Target storage for the compressed dataset (default: file server) Examples: s3://bucket/data, gs://bucket/data , azure://bucket/data , /mnt/share/data
compression (Optional [ str ] ) – Compression algorithm for the Zipped dataset file (default: ZIP_DEFLATED)
chunk_size (int ) – Artifact chunk size (MB) for the compressed dataset, if not provided (None) use the default chunk size (512mb). If -1 is provided, use a single zip artifact for the entire dataset change-set (old behaviour)
max_workers (Optional [ int ] ) – Numbers of threads to be spawned when zipping and uploading the files.
If None (default) it will be set to:
1: if the upload destination is a cloud provider (‘s3’, ‘gs’, ‘azure’)
number of logical cores: otherwise
retries (int ) – Number of retries before failing to upload each zip. If 0, the upload is not retried.
Raise
If the upload failed (i.e. at least one zip failed to upload), raise a ValueError
Return type
()
verify_dataset_hash
verify_dataset_hash(local_copy_path=None, skip_hash=False, verbose=False)
Verify the current copy of the dataset against the stored hash
Parameters
local_copy_path (
Optional
[str
]) – Specify local path containing a copy of the dataset, If not provide use the cached folderskip_hash (
bool
) – If True, skip hash checks and verify file size onlyverbose (
bool
) – If True, print errors while testing dataset files hash
Return type
List
[str
]Returns
List of files with unmatched hashes