Skip to main content

Dataset

class Dataset()

Do not use directly! Use Dataset.create(…) or Dataset.get(…) instead.


add_external_files

add_external_files(source_url, wildcard=None, dataset_path=None, recursive=True, verbose=False, max_workers=None)

Adds external files or folders to the current dataset. External file links can be from cloud storage (s3://, gs://, azure://), local / network storage (file://) or http(s)// files. Calculates file size for each file and compares against parent.

A few examples:

  • Add file.jpg to the dataset. When retrieving a copy of the entire dataset (see dataset.get_local_copy()). This file will be located in “./my_dataset/new_folder/file.jpg”. add_external_files(source_url=”s3://my_bucket/stuff/file.jpg”, dataset_path=”/my_dataset/new_folder/”)

  • Add all jpg files located in s3 bucket called “my_bucket” to the dataset. add_external_files(source_url=”s3://my/bucket/”, wildcard = “*.jpg”, dataset_path=”/my_dataset/new_folder/”)

  • Add the entire content of “remote_folder” to the dataset. add_external_files(source_url=”s3://bucket/remote_folder/”, dataset_path=”/my_dataset/new_folder/”)

  • Add the local file “/folder/local_file.jpg” to the dataset. add_external_files(source_url=”file:///folder/local_file.jpg”, dataset_path=”/my_dataset/new_folder/”)

  • Parameters

    • source_url (Union[str, Sequence[str]]) – Source url link (e.g. s3://bucket/folder/path) or list/tuple of links to add to the dataset (e.g. [s3://bucket/folder/file.csv, http://web.com/file.txt])

    • wildcard (Union[str, Sequence[str], None]) – add only specific set of files. Wildcard matching, can be a single string or a list of wildcards.

    • dataset_path (Optional[str]) – The location in the dataset where the file will be downloaded into. e.g: for source_url=’s3://bucket/remote_folder/image.jpg’ and dataset_path=’s3_files’, ‘image.jpg’ will be downloaded to ‘s3_files/image.jpg’ (relative path to the dataset)

    • recursive (bool) – If True match all wildcard files recursively

    • verbose (bool) – If True print to console files added/modified

    • max_workers (Optional[int]) – The number of threads to add the external files with. Useful when source_url is a sequence. Defaults to the number of logical cores

  • Return type

    int

  • Returns

    Number of file links added


add_files

add_files(path, wildcard=None, local_base_folder=None, dataset_path=None, recursive=True, verbose=False, max_workers=None)

Add a folder into the current dataset. calculate file hash, and compare against parent, mark files to be uploaded

  • Parameters

    • path (Union [ str , Path , _Path ] ) – Add a folder/file to the dataset

    • wildcard (Optional [ Union [ str , Sequence [ str ] ] ] ) – add only specific set of files. Wildcard matching, can be a single string or a list of wildcards.

    • local_base_folder (Optional [ str ] ) – files will be located based on their relative path from local_base_folder

    • dataset_path (Optional [ str ] ) – where in the dataset the folder/files should be located

    • recursive (bool ) – If True match all wildcard files recursively

    • verbose (bool ) – If True print to console files added/modified

    • max_workers (Optional [ int ] ) – The number of threads to add the files with. Defaults to the number of logical cores

  • Return type

    ()

  • Returns

    number of files added


add_tags

add_tags(tags)

Add Tags to this dataset. Old tags are not deleted. When executing a Task (experiment) remotely, this method has no effect.

  • Parameters

    tags (Union[Sequence[str], str]) – A list of tags which describe the Task to add.

  • Return type

    None


Dataset.create

classmethod create(dataset_name=None, dataset_project=None, dataset_tags=None, parent_datasets=None, use_current_task=False, dataset_version=None, output_uri=None, description=None)

Create a new dataset. Multiple dataset parents are supported. Merging of parent datasets is done based on the order, where each one can override overlapping files in the previous parent

  • Parameters

    • dataset_name (Optional[str]) – Naming the new dataset

    • dataset_project (Optional[str]) – Project containing the dataset. If not specified, infer project name form parent datasets

    • dataset_tags (Optional[Sequence[str]]) – Optional, list of tags (strings) to attach to the newly created Dataset

    • parent_datasets (Optional[Sequence[Union[str, Dataset]]]) – Expand a parent dataset by adding/removing files

    • use_current_task (bool) – False (default), a new Dataset task is created. If True, the dataset is created on the current Task.

    • dataset_version (Optional[str]) – Version of the new dataset. If not set, try to find the latest version of the dataset with given dataset_name and dataset_project and auto-increment it.

    • output_uri (Optional[str]) – Location to upload the datasets file to, including preview samples. The following are examples of output_uri values for the supported locations:

      * A shared folder: `/mnt/share/folder`

      * S3: `s3://bucket/folder`

      * Google Cloud Storage: `gs://bucket-name/folder`

      * Azure Storage: `azure://company.blob.core.windows.net/folder/`

      * Default file server: None
    • description (Optional[str]) – Description of the dataset

  • Return type

    ForwardRef

  • Returns

    Newly created Dataset object


Dataset.delete

classmethod delete(dataset_id=None, dataset_project=None, dataset_name=None, force=False, dataset_version=None, entire_dataset=False, shallow_search=False)

Delete the dataset(s). If multiple datasets match the parameters, raise an Exception or move the entire dataset if entire_dataset is True and force is True

  • Parameters

    • dataset_id – The ID of the dataset(s) to be deleted

    • dataset_project – The project the dataset(s) to be deleted belong(s) to

    • dataset_name – The name of the dataset(s) to be deleted

    • force – If True, deleted the dataset(s) even when being used. Also required to be set to True when entire_dataset is set.

    • dataset_version – The version of the dataset(s) to be deletedd

    • entire_dataset – If True, delete all datasets that match the given dataset_project, dataset_name, dataset_version. Note that force has to be True if this paramer is True

    • shallow_search – If True, search only the first 500 results (first page)

  • Return type

    ()


file_entries_dict

property file_entries_dict

Notice this call returns an internal representation, do not modify! :rtype: Mapping[str, FileEntry] :return: dict with relative file path as key, and FileEntry as value


finalize

finalize(verbose=False, raise_on_error=True, auto_upload=False)

Finalize the dataset publish dataset Task. upload must first called to verify there are not pending uploads. If files do need to be uploaded, it throws an exception (or return False)

  • Parameters

    • verbose (bool) – If True print verbose progress report

    • raise_on_error (bool) – If True raise exception if dataset finalizing failed

    • auto_upload (bool) – Automatically upload dataset if not called yet, will upload to default location.

  • Return type

    bool


Dataset.get

classmethod get(dataset_id=None, dataset_project=None, dataset_name=None, dataset_tags=None, only_completed=False, only_published=False, auto_create=False, writable_copy=False, dataset_version=None, alias=None, overridable=False, shallow_search=False, kwargs)**

Get a specific Dataset. If multiple datasets are found, the dataset with the highest semantic version is returned. If no semantic version if found, the most recently updated dataset is returned. This functions raises an Exception in case no dataset can be found and the auto_create=True flag is not set

  • Parameters

    • dataset_id (Optional[str]) – Requested dataset ID

    • dataset_project (Optional[str]) – Requested dataset project name

    • dataset_name (Optional[str]) – Requested dataset name

    • dataset_tags (Optional[Sequence[str]]) – Requested dataset tags (list of tag strings)

    • only_completed (bool) – Return only if the requested dataset is completed or published

    • only_published (bool) – Return only if the requested dataset is published

    • auto_create (bool) – Create a new dataset if it does not exist yet

    • writable_copy (bool) – Get a newly created mutable dataset with the current one as its parent, so new files can added to the instance.

    • dataset_version (Optional[str]) – Requested version of the Dataset

    • alias (Optional[str]) – Alias of the dataset. If set, the ‘alias : dataset ID’ key-value pair will be set under the hyperparameters section ‘Datasets’

    • overridable (bool) – If True, allow overriding the dataset ID with a given alias in the hyperparameters section. Useful when one wants to change the dataset used when running a task remotely. If the alias parameter is not set, this parameter has no effect

    • shallow_search (bool) – If True, search only the first 500 results (first page)

  • Return type

    ForwardRef

  • Returns

    Dataset object


get_default_storage

get_default_storage()

Return the default storage location of the dataset

  • Return type

    Optional[str]

  • Returns

    URL for the default storage location


get_dependency_graph

get_dependency_graph()

return the DAG of the dataset dependencies (all previous dataset version and their parents)

Example:

{
'current_dataset_id': ['parent_1_id', 'parent_2_id'],
'parent_2_id': ['parent_1_id'],
'parent_1_id': [],
}
  • Returns

    dict representing the genealogy dag graph of the current dataset


get_local_copy

get_local_copy(use_soft_links=None, part=None, num_parts=None, raise_on_error=True, max_workers=None)

Return a base folder with a read-only (immutable) local copy of the entire dataset download and copy / soft-link, files from all the parent dataset versions. The dataset needs to be finalized

  • Parameters

    • use_soft_links (Optional[bool]) – If True use soft links, default False on windows True on Posix systems

    • part (Optional[int]) – Optional, if provided only download the selected part (index) of the Dataset. First part number is 0 and last part is num_parts-1 Notice, if num_parts is not provided, number of parts will be equal to the total number of chunks (i.e. sum over all chunks from the specified Dataset including all parent Datasets). This argument is passed to parent datasets, as well as the implicit num_parts, allowing users to get a partial copy of the entire dataset, for multi node/step processing.

    • num_parts (Optional[int]) – Optional, If specified normalize the number of chunks stored to the requested number of parts. Notice that the actual chunks used per part are rounded down. Example: Assuming total 8 chunks for this dataset (including parent datasets), and num_parts=5, the chunk index used per parts would be: part=0 -> chunks[0,5], part=1 -> chunks[1,6], part=2 -> chunks[2,7], part=3 -> chunks[3, ]

    • raise_on_error (bool) – If True raise exception if dataset merging failed on any file

    • max_workers (Optional[int]) – Number of threads to be spawned when getting the dataset copy. Defaults to the number of logical cores.

  • Return type

    str

  • Returns

    A base folder for the entire dataset


get_logger

get_logger()

Return a Logger object for the Dataset, allowing users to report statistics metrics and debug samples on the Dataset itself :rtype: Logger :return: Logger object


get_mutable_local_copy

get_mutable_local_copy(target_folder, overwrite=False, part=None, num_parts=None, raise_on_error=True, max_workers=None)

return a base folder with a writable (mutable) local copy of the entire dataset

download and copy / soft-link, files from all the parent dataset versions
  • Parameters

    • target_folder (Union[Path, Path, str]) – Target folder for the writable copy

    • overwrite (bool) – If True, recursively delete the target folder before creating a copy. If False (default) and target folder contains files, raise exception or return None

    • part (Optional[int]) – Optional, if provided only download the selected part (index) of the Dataset. First part number is 0 and last part is num_parts-1 Notice, if num_parts is not provided, number of parts will be equal to the total number of chunks (i.e. sum over all chunks from the specified Dataset including all parent Datasets). This argument is passed to parent datasets, as well as the implicit num_parts, allowing users to get a partial copy of the entire dataset, for multi node/step processing.

    • num_parts (Optional[int]) – Optional, If specified normalize the number of chunks stored to the requested number of parts. Notice that the actual chunks used per part are rounded down. Example: Assuming total 8 chunks for this dataset (including parent datasets), and num_parts=5, the chunk index used per parts would be: part=0 -> chunks[0,5], part=1 -> chunks[1,6], part=2 -> chunks[2,7], part=3 -> chunks[3, ]

    • raise_on_error (bool) – If True raise exception if dataset merging failed on any file

    • max_workers (Optional[int]) – Number of threads to be spawned when getting the dataset copy. Defaults to the number of logical cores.

  • Return type

    Optional[str]

  • Returns

    The target folder containing the entire dataset


get_num_chunks

get_num_chunks(include_parents=True)

Return the number of chunks stored on this dataset (it does not imply on the number of chunks parent versions store)

  • Parameters

    include_parents (bool) – If True (default),

  • Return type

    int

return the total number of chunks from this version and all parent versions. If False, only return the number of chunks we stored on this specific version.

  • Return type

    int

  • Returns

    Number of chunks stored on the dataset.

  • Parameters

    include_parents (bool ) –


is_dirty

is_dirty()

Return True if the dataset has pending uploads (i.e. we cannot finalize it)

  • Return type

    bool

  • Returns

    Return True means dataset has pending uploads, call ‘upload’ to start an upload process.


is_final

is_final()

Return True if the dataset was finalized and cannot be changed any more.

  • Return type

    bool

  • Returns

    True if dataset if final


property link_entries_dict

Notice this call returns an internal representation, do not modify! :rtype: Mapping[str, LinkEntry] :return: dict with relative file path as key, and LinkEntry as value


list_added_files

list_added_files(dataset_id=None)

return a list of files added when comparing to a specific dataset_id

  • Parameters

    dataset_id (Optional[str]) – dataset id (str) to compare against, if None is given compare against the parents datasets

  • Return type

    List[str]

  • Returns

    List of files with relative path (files might not be available locally until get_local_copy() is called)


Dataset.list_datasets

classmethod list_datasets(dataset_project=None, partial_name=None, tags=None, ids=None, only_completed=True, recursive_project_search=True)

Query list of dataset in the system

  • Parameters

    • dataset_project (Optional[str]) – Specify dataset project name

    • partial_name (Optional[str]) – Specify partial match to a dataset name

    • tags (Optional[Sequence[str]]) – Specify user tags

    • ids (Optional[Sequence[str]]) – List specific dataset based on IDs list

    • only_completed (bool) – If False return dataset that are still in progress (uploading/edited etc.)

    • recursive_project_search (bool) – If True and the dataset_project argument is set, search inside subprojects as well. If False, don’t search inside subprojects (except for the special .datasets subproject)

  • Return type

    List[dict]

  • Returns

    List of dictionaries with dataset information Example: [{‘name’: name, ‘project’: project name, ‘id’: dataset_id, ‘created’: date_created},]


list_files

list_files(dataset_path=None, recursive=True, dataset_id=None)

returns a list of files in the current dataset If dataset_id is provided, return a list of files that remained unchanged since the specified dataset_id

  • Parameters

    • dataset_path (Optional[str]) – Only match files matching the dataset_path (including wildcards). Example: ‘folder/sub/*.json’

    • recursive (bool) – If True (default) matching dataset_path recursively

    • dataset_id (Optional[str]) – Filter list based on the dataset id containing the latest version of the file. Default: None, do not filter files based on parent dataset.

  • Return type

    List[str]

  • Returns

    List of files with relative path (files might not be available locally until get_local_copy() is called)


list_modified_files

list_modified_files(dataset_id=None)

return a list of files modified when comparing to a specific dataset_id

  • Parameters

    dataset_id (Optional[str]) – dataset id (str) to compare against, if None is given compare against the parents datasets

  • Return type

    List[str]

  • Returns

    List of files with relative path (files might not be available locally until get_local_copy() is called)


list_removed_files

list_removed_files(dataset_id=None)

return a list of files removed when comparing to a specific dataset_id

  • Parameters

    dataset_id (Optional[str]) – dataset id (str) to compare against, if None is given compare against the parents datasets

  • Return type

    List[str]

  • Returns

    List of files with relative path (files might not be available locally until get_local_copy() is called)


Dataset.move_to_project

classmethod move_to_project(new_dataset_project, dataset_project, dataset_name)

Move the dataset to a another project.

  • Parameters

    • new_dataset_project – New project to move the dataset(s) to

    • dataset_project – Project of the dataset(s) to move to new project

    • dataset_name – Name of the dataset(s) to move to new project

  • Return type

    ()


publish

publish(raise_on_error=True)

Publish the dataset If dataset is not finalize, throw exception

  • Parameters

    raise_on_error (bool) – If True raise exception if dataset publishing failed

  • Return type

    bool


remove_files

remove_files(dataset_path=None, recursive=True, verbose=False)

Remove files from the current dataset

  • Parameters

    • dataset_path (Optional[str]) – Remove files from the dataset. The path is always relative to the dataset (e.g ‘folder/file.bin’). External files can also be removed by their links (e.g. ‘s3://bucket/file’)

    • recursive (bool) – If True match all wildcard files recursively

    • verbose (bool) – If True print to console files removed

  • Return type

    int

  • Returns

    Number of files removed


Dataset.rename

classmethod rename(new_dataset_name, dataset_project, dataset_name)

Rename the dataset.

  • Parameters

    • new_dataset_name – The new name of the datasets to be renamed

    • dataset_project – The project the datasets to be renamed belongs to

    • dataset_name – The name of the datasets (before renaming)

  • Return type

    ()


set_description

set_description(description)

Set description of the dataset

  • Parameters

    description (str ) – Description to be set

  • Return type

    ()


Dataset.squash

classmethod squash(dataset_name, dataset_ids=None, dataset_project_name_pairs=None, output_url=None)

Generate a new dataset from the squashed set of dataset versions. If a single version is given it will squash to the root (i.e. create single standalone version) If a set of versions are given it will squash the versions diff into a single version

  • Parameters

    • dataset_name (str ) – Target name for the newly generated squashed dataset

    • dataset_ids (Optional [ Sequence [ Union [ str , Dataset ] ] ] ) – List of dataset Ids (or objects) to squash. Notice order does matter. The versions are merged from first to last.

    • dataset_project_name_pairs (Optional [ Sequence [ ( str , str ) ] ] ) – List of pairs (project_name, dataset_name) to squash. Notice order does matter. The versions are merged from first to last.

    • output_url (Optional [ str ] ) – Target storage for the compressed dataset (default: file server) Examples: s3://bucket/data, gs://bucket/data , azure://bucket/data , /mnt/share/data

  • Return type

    “Dataset”

  • Returns

    Newly created dataset object.


sync_folder

sync_folder(local_path, dataset_path=None, verbose=False)

Synchronize the dataset with a local folder. The dataset is synchronized from the relative_base_folder (default: dataset root) and deeper with the specified local path. Note that if a remote file is identified in as being modified when syncing, it will be added as a FileEntry, ready to be uploaded to the ClearML server. This version of the file is considered “newer” and it will be downloaded instead of the one stored at its remote address when calling Dataset.get_local_copy().

  • Parameters

    • local_path (Union [ Path , _Path , str ] ) – Local folder to sync (assumes all files and recursive)

    • dataset_path (Union [ Path , _Path , str ] ) – Target dataset path to sync with (default the root of the dataset)

    • verbose (bool ) – If true print to console files added/modified/removed

  • Return type

    (int, int)

  • Returns

    number of files removed, number of files modified/added


update_changed_files

update_changed_files(num_files_added=None, num_files_modified=None, num_files_removed=None)

Update the internal state keeping track of added, modified and removed files.

  • Parameters

    • num_files_added – Amount of files added when compared to the parent dataset

    • num_files_modified – Amount of files with the same name but a different hash when compared to the parent dataset

    • num_files_removed – Amount of files removed when compared to the parent dataset


upload

upload(show_progress=True, verbose=False, output_url=None, compression=None, chunk_size=None, max_workers=None, retries=3)

Start file uploading, the function returns when all files are uploaded.

  • Parameters

    • show_progress (bool ) – If True show upload progress bar

    • verbose (bool ) – If True print verbose progress report

    • output_url (Optional [ str ] ) – Target storage for the compressed dataset (default: file server) Examples: s3://bucket/data, gs://bucket/data , azure://bucket/data , /mnt/share/data

    • compression (Optional [ str ] ) – Compression algorithm for the Zipped dataset file (default: ZIP_DEFLATED)

    • chunk_size (int ) – Artifact chunk size (MB) for the compressed dataset, if not provided (None) use the default chunk size (512mb). If -1 is provided, use a single zip artifact for the entire dataset change-set (old behaviour)

    • max_workers (Optional [ int ] ) – Numbers of threads to be spawned when zipping and uploading the files. If None (default) it will be set to:

      - 1: if the upload destination is a cloud provider (‘s3’, ‘gs’, ‘azure’)
      - number of logical cores: otherwise

      * **retries** (*int* ) – Number of retries before failing to upload each zip. If 0, the upload is not retried.
  • Raise

    If the upload failed (i.e. at least one zip failed to upload), raise a ValueError

  • Return type

    ()


verify_dataset_hash

verify_dataset_hash(local_copy_path=None, skip_hash=False, verbose=False)

Verify the current copy of the dataset against the stored hash

  • Parameters

    • local_copy_path (Optional[str]) – Specify local path containing a copy of the dataset, If not provide use the cached folder

    • skip_hash (bool) – If True, skip hash checks and verify file size only

    • verbose (bool) – If True print errors while testing dataset files hash

  • Return type

    List[str]

  • Returns

    List of files with unmatched hashes