Skip to main content

Dataset

class Dataset()#

Do not use directly! Use Dataset.create(…) or Dataset.get(…) instead.


add_files#

add_files(path, wildcard=None, local_base_folder=None, dataset_path=None, recursive=True, verbose=False)

Add a folder into the current dataset. calculate file hash, and compare against parent, mark files to be uploaded

  • Parameters

    • path (Union [ str , Path , _Path ] ) – Add a folder/file to the dataset

    • wildcard (Optional [ Union [ str , Sequence [ str ] ] ] ) – add only specific set of files. Wildcard matching, can be a single string or a list of wildcards)

    • local_base_folder (Optional [ str ] ) – files will be located based on their relative path from local_base_folder

    • dataset_path (Optional [ str ] ) – where in the dataset the folder/files should be located

    • recursive (bool ) – If True match all wildcard files recursively

    • verbose (bool ) – If True print to console files added/modified

  • Return type

    ()

  • Returns

    number of files added


Dataset.create#

classmethod create(dataset_name=None, dataset_project=None, dataset_tags=None, parent_datasets=None, use_current_task=False)

Create a new dataset. Multiple dataset parents are supported. Merging of parent datasets is done based on the order, where each one can override overlapping files in the previous parent

  • Parameters

    • dataset_name (Optional[str]) – Naming the new dataset

    • dataset_project (Optional[str]) – Project containing the dataset. If not specified, infer project name form parent datasets

    • dataset_tags (Optional[Sequence[str]]) – Optional, list of tags (strings) to attach to the newly created Dataset

    • parent_datasets (Optional[Sequence[Union[str, Dataset]]]) – Expand a parent dataset by adding/removing files

    • use_current_task (bool) – False (default), a new Dataset task is created. If True, the dataset is created on the current Task.

  • Return type

    ForwardRef

  • Returns

    Newly created Dataset object


Dataset.delete#

classmethod delete(dataset_id=None, dataset_project=None, dataset_name=None, force=False)

Delete a dataset, raise exception if dataset is used by other dataset versions. Use force=True to forcefully delete the dataset

  • Parameters

    • dataset_id (Optional [ str ] ) – Dataset id to delete

    • dataset_project (Optional [ str ] ) – Project containing the dataset

    • dataset_name (Optional [ str ] ) – Naming the new dataset

    • force (bool ) – If True delete even if other datasets depend on the specified dataset version

  • Return type

    ()


file_entries_dict#

property file_entries_dict

Notice this call returns an internal representation, do not modify! :rtype: Mapping[str, FileEntry] :return: dict with relative file path as key, and FileEntry as value


finalize#

finalize(verbose=False, raise_on_error=True)

Finalize the dataset publish dataset Task. upload must first called to verify there are not pending uploads. If files do need to be uploaded, it throws an exception (or return False)

  • Parameters

    • verbose (bool) – If True print verbose progress report

    • raise_on_error (bool) – If True raise exception if dataset finalizing failed

  • Return type

    bool


Dataset.get#

classmethod get(dataset_id=None, dataset_project=None, dataset_name=None, dataset_tags=None, only_completed=False, only_published=False)

Get a specific Dataset. If only dataset_project is given, return the last Dataset in the Dataset project

  • Parameters

    • dataset_id (Optional[str]) – Requested Dataset ID

    • dataset_project (Optional[str]) – Requested Dataset project name

    • dataset_name (Optional[str]) – Requested Dataset name

    • dataset_tags (Optional[Sequence[str]]) – Requested Dataset tags (list of tag strings)

    • only_completed (bool) – Return only if the requested dataset is completed or published

    • only_published (bool) – Return only if the requested dataset is published

  • Return type

    ForwardRef

  • Returns

    Dataset object


get_default_storage#

get_default_storage()

Return the default storage location of the dataset

  • Return type

    Optional[str]

  • Returns

    URL for the default storage location


get_dependency_graph#

get_dependency_graph()

return the DAG of the dataset dependencies (all previous dataset version and their parents)

Example:

{
'current_dataset_id': ['parent_1_id', 'parent_2_id'],
'parent_2_id': ['parent_1_id'],
'parent_1_id': [],
}
  • Returns

    dict representing the genealogy dag graph of the current dataset


get_local_copy#

get_local_copy(use_soft_links=None, part=None, num_parts=None, raise_on_error=True)

return a base folder with a read-only (immutable) local copy of the entire dataset

download and copy / soft-link, files from all the parent dataset versions
  • Parameters

    • use_soft_links (Optional[bool]) – If True use soft links, default False on windows True on Posix systems

    • part (Optional[int]) – Optional, if provided only download the selected part (index) of the Dataset. First part number is 0 and last part is num_parts-1 Notice, if num_parts is not provided, number of parts will be equal to the total number of chunks (i.e. sum over all chunks from the specified Dataset including all parent Datasets). This argument is passed to parent datasets, as well as the implicit num_parts, allowing users to get a partial copy of the entire dataset, for multi node/step processing.

    • num_parts (Optional[int]) – Optional, If specified normalize the number of chunks stored to the requested number of parts. Notice that the actual chunks used per part are rounded down. Example: Assuming total 8 chunks for this dataset (including parent datasets), and num_parts=5, the chunk index used per parts would be: part=0 -> chunks[0,5], part=1 -> chunks[1,6], part=2 -> chunks[2,7], part=3 -> chunks[3, ]

    • raise_on_error (bool) – If True raise exception if dataset merging failed on any file

  • Return type

    str

  • Returns

    A base folder for the entire dataset


get_logger#

get_logger()

Return a Logger object for the Dataset, allowing users to report statistics metrics and debug samples on the Dataset itself :rtype: Logger :return: Logger object


get_mutable_local_copy#

get_mutable_local_copy(target_folder, overwrite=False, part=None, num_parts=None, raise_on_error=True)

return a base folder with a writable (mutable) local copy of the entire dataset

download and copy / soft-link, files from all the parent dataset versions
  • Parameters

    • target_folder (Union[Path, Path, str]) – Target folder for the writable copy

    • overwrite (bool) – If True, recursively delete the target folder before creating a copy. If False (default) and target folder contains files, raise exception or return None

    • part (Optional[int]) – Optional, if provided only download the selected part (index) of the Dataset. First part number is 0 and last part is num_parts-1 Notice, if num_parts is not provided, number of parts will be equal to the total number of chunks (i.e. sum over all chunks from the specified Dataset including all parent Datasets). This argument is passed to parent datasets, as well as the implicit num_parts, allowing users to get a partial copy of the entire dataset, for multi node/step processing.

    • num_parts (Optional[int]) – Optional, If specified normalize the number of chunks stored to the requested number of parts. Notice that the actual chunks used per part are rounded down. Example: Assuming total 8 chunks for this dataset (including parent datasets), and num_parts=5, the chunk index used per parts would be: part=0 -> chunks[0,5], part=1 -> chunks[1,6], part=2 -> chunks[2,7], part=3 -> chunks[3, ]

    • raise_on_error (bool) – If True raise exception if dataset merging failed on any file

  • Return type

    Optional[str]

  • Returns

    A the target folder containing the entire dataset


get_num_chunks#

get_num_chunks(include_parents=True)

Return the number of chunks stored on this dataset (it does not imply on the number of chunks parent versions store)

  • Parameters

    include_parents (bool) – If True (default),

  • Return type

    int

return the total number of chunks from this version and all parent versions. If False, only return the number of chunks we stored on this specific version.

  • Return type

    int

  • Returns

    Number of chunks stored on the dataset.

  • Parameters

    include_parents (bool ) –


is_dirty#

is_dirty()

Return True if the dataset has pending uploads (i.e. we cannot finalize it)

  • Return type

    bool

  • Returns

    Return True means dataset has pending uploads, call β€˜upload’ to start an upload process.


is_final#

is_final()

Return True if the dataset was finalized and cannot be changed any more.

  • Return type

    bool

  • Returns

    True if dataset if final


list_added_files#

list_added_files(dataset_id=None)

return a list of files removed when comparing to a specific dataset_version

  • Parameters

    dataset_id (Optional[str]) – dataset id (str) to compare against, if None is given compare against the parents datasets

  • Return type

    List[str]

  • Returns

    List of files with relative path (files might not be available locally until get_local_copy() is called)


Dataset.list_datasets#

classmethod list_datasets(dataset_project=None, partial_name=None, tags=None, ids=None, only_completed=True)

Query list of dataset in the system

  • Parameters

    • dataset_project (Optional[str]) – Specify dataset project name

    • partial_name (Optional[str]) – Specify partial match to a dataset name

    • tags (Optional[Sequence[str]]) – Specify user tags

    • ids (Optional[Sequence[str]]) – List specific dataset based on IDs list

    • only_completed (bool) – If False return dataset that are still in progress (uploading/edited etc.)

  • Return type

    List[dict]

  • Returns

    List of dictionaries with dataset information Example: [{β€˜name’: name, β€˜project’: project name, β€˜id’: dataset_id, β€˜created’: date_created},]


list_files#

list_files(dataset_path=None, recursive=True, dataset_id=None)

returns a list of files in the current dataset If dataset_id is provided, return a list of files that remained unchanged since the specified dataset_version

  • Parameters

    • dataset_path (Optional[str]) – Only match files matching the dataset_path (including wildcards). Example: β€˜folder/sub/*.json’

    • recursive (bool) – If True (default) matching dataset_path recursively

    • dataset_id (Optional[str]) – Filter list based on the dataset id containing the latest version of the file. Default: None, do not filter files based on parent dataset.

  • Return type

    List[str]

  • Returns

    List of files with relative path (files might not be available locally until get_local_copy() is called)


list_modified_files#

list_modified_files(dataset_id=None)

return a list of files removed when comparing to a specific dataset_version

  • Parameters

    dataset_id (Optional[str]) – dataset id (str) to compare against, if None is given compare against the parents datasets

  • Return type

    List[str]

  • Returns

    List of files with relative path (files might not be available locally until get_local_copy() is called)


list_removed_files#

list_removed_files(dataset_id=None)

return a list of files removed when comparing to a specific dataset_version

  • Parameters

    dataset_id (Optional[str]) – dataset id (str) to compare against, if None is given compare against the parents datasets

  • Return type

    List[str]

  • Returns

    List of files with relative path (files might not be available locally until get_local_copy() is called)


publish#

publish(raise_on_error=True)

Publish the dataset If dataset is not finalize, throw exception

  • Parameters

    raise_on_error (bool) – If True raise exception if dataset publishing failed

  • Return type

    bool


remove_files#

remove_files(dataset_path=None, recursive=True, verbose=False)

Remove files from the current dataset

  • Parameters

    • dataset_path (Optional[str]) – Remove files from the dataset. The path is always relative to the dataset (e.g β€˜folder/file.bin’)

    • recursive (bool) – If True match all wildcard files recursively

    • verbose (bool) – If True print to console files removed

  • Return type

    int

  • Returns

    Number of files removed


Dataset.squash#

classmethod squash(dataset_name, dataset_ids=None, dataset_project_name_pairs=None, output_url=None)

Generate a new dataset from the squashed set of dataset versions. If a single version is given it will squash to the root (i.e. create single standalone version) If a set of versions are given it will squash the versions diff into a single version

  • Parameters

    • dataset_name (str ) – Target name for the newly generated squashed dataset

    • dataset_ids (Optional [ Sequence [ Union [ str , Dataset ] ] ] ) – List of dataset Ids (or objects) to squash. Notice order does matter. The versions are merged from first to last.

    • dataset_project_name_pairs (Optional [ Sequence [ ( str , str ) ] ] ) – List of pairs (project_name, dataset_name) to squash. Notice order does matter. The versions are merged from first to last.

    • output_url (Optional [ str ] ) – Target storage for the compressed dataset (default: file server) Examples: s3://bucket/data, gs://bucket/data , azure://bucket/data , /mnt/share/data

  • Return type

    β€œDataset”

  • Returns

    Newly created dataset object.


sync_folder#

sync_folder(local_path, dataset_path=None, verbose=False)

Synchronize the dataset with a local folder. The dataset is synchronized from the

relative_base_folder (default: dataset root) and deeper with the specified local path.
  • Parameters

    • local_path (Union [ Path , _Path , str ] ) – Local folder to sync (assumes all files and recursive)

    • dataset_path (Union [ Path , _Path , str ] ) – Target dataset path to sync with (default the root of the dataset)

    • verbose (bool ) – If true print to console files added/modified/removed

  • Return type

    (int, int)

  • Returns

    number of files removed, number of files modified/added


upload#

upload(show_progress=True, verbose=False, output_url=None, compression=None, chunk_size=None)

Start file uploading, the function returns when all files are uploaded.

  • Parameters

    • show_progress (bool ) – If True show upload progress bar

    • verbose (bool ) – If True print verbose progress report

    • output_url (Optional [ str ] ) – Target storage for the compressed dataset (default: file server) Examples: s3://bucket/data, gs://bucket/data , azure://bucket/data , /mnt/share/data

    • compression (Optional [ str ] ) – Compression algorithm for the Zipped dataset file (default: ZIP_DEFLATED)

    • chunk_size (int ) – Artifact chunk size (MB) for the compressed dataset, if not provided (None) use the default chunk size (512mb). If -1 is provided, use a single zip artifact for the entire dataset change-set (old behaviour)

  • Return type

    ()


verify_dataset_hash#

verify_dataset_hash(local_copy_path=None, skip_hash=False, verbose=False)

Verify the current copy of the dataset against the stored hash

  • Parameters

    • local_copy_path (Optional[str]) – Specify local path containing a copy of the dataset, If not provide use the cached folder

    • skip_hash (bool) – If True, skip hash checks and verify file size only

    • verbose (bool) – If True print errors while testing dataset files hash

  • Return type

    List[str]

  • Returns

    List of files with unmatched hashes