Skip to main content

DatasetVersion

class datasetversion.DatasetVersion()

DatasetVersion represents a specific version in a dataset.

danger

Do not instantiate directly. Use DatasetVersion.get_version method instead.


BulkContext

class BulkContext

A context manager to modify frames (i.e. add/update/remove) in bulk.

Use DatasetVersion.get_bulk_context to obtain.

The bulk context allows modifying the version by adding/updating/deleting of frames one at a time, but the actual update request will happen in bulk. The update request (flush) will happen every flush_threshold updates, or upon __exit__.

Create Bulk context for automatically flushing frames

  • Parameters

    • dv (DatasetVersion) – DatasetVersion object to use

    • flush_threshold (Optional[int]) – If provided flush every X frames

    • log (Optional[Logger]) – Optional, provide external logger

    • refresh_version_stats (Optional[bool]) – automatically refresh version statistics (default: True)

    • auto_upload_destination (Optional[str]) – If specified any local file linked by a SingleFrame/FrameGroup, will be automatically uploaded to the destination storage.

    • local_dataset_root_path (Union[str, Path, None]) – Required if auto_upload_destination is provided. It should point to the common folder for all local source files

    • allow_update (bool) – If False (default), all frame operations will use the “add” action and “update” will not be used (i.e. even frames collected using the BulkContext.update() call will be added, not updated). This is an advanced setting, please change only if you understand the limitations of using update. Note that when using update, provided frame data is merged with the existing indexed frame data - this means frame fields cannot be removed when using the update operation.


add_frame

BulkContext.add_frame(frame, warn_on_duplicate_frames=False)

NOTICE! If frames already contain frame.id field, they will update (overwrite) existing frames. If not provided, frame.id is generated based on the source URI. If a local file should be uploaded but has already been previously uploaded, the existing URI for that file will be reused, otherwise the file will be uploaded.

info

Only available if version is still in draft (writable) mode

  • Parameters

    • frame (DatasetVersion.Frame ) – The frame to add to the version.

    • warn_on_duplicate_frames (Optional[bool]) – If True, issue a warning when adding a frame with an ID that was previously added to this instance (default False)

  • Return type

    None


delete_frame

BulkContext.delete_frame(frame, delete_sources=False)

Delete a frame from the current DatasetVersion.

The frame may be represented by an ID string, or a DatasetVersion.Frame object. Frames are deleted by their IDs, all other frame attributes (if exists) are ignored.

info

Only available if version is still in draft (writable) mode.

  • Parameters

    • frame (Union[FrameGroup, SingleFrame, str, ForwardRef]) – The frame to delete (frame object or ID string)

    • delete_sources (bool) – Delete sources associated with the deleted frames in the dataset. Supported source locations are: s3, gs and azure. In case a connection cannot be established with the cloud provider or a source deletion failed, the operation will abort.

  • Return type

    None


flush

BulkContext.flush()

Send any outstanding version changes.

Any updates made using this BulkContext are sent to the server.

  • Return type

    None


update_frame

BulkContext.update_frame(frame)

Update an existing frame in the current DatasetVersion.

Find the frame by its ID, and change its properties to match that of the frame object passed in frame. Frames exist in a version if they were previously added (e.g. by update_frame), or if they exist in a parent version. If the frame object does not have an ID, create a new frame.

info

Only available if version is still in draft (writable) mode.

  • Parameters

    frame (DatasetVersion.Frame ) – The frame to update.

  • Return type

    None


version_id

property version_id

Version ID string of this specific dataset/version

  • Return type

    str


version_name

property version_name

Dataset version name, not necessarily unique

  • Return type

    str


dataset_id

property dataset_id

Dataset ID string of this specific dataset

  • Return type

    str


dataset_name

property dataset_name

Dataset name, must be a unique name

  • Return type

    str


draft

property draft

Draft flag of the dataset/version, i.e. is this version still writable or is it locked and cannot be changed.

  • Return type

    bool


last_updated

property last_updated

Return the timestamp of the last updated frame in the dataset version

  • Return type

    datetime


comment

property comment

Return the string comment of the specific Dataset Version

  • Return type

    str


DatasetVersion.create_new_dataset

classmethod create_new_dataset(dataset_name=None, description=None, tags=None, raise_if_exists=False, dataset_project=None)

Create a new dataset in the system and return a Dataset object for it.

  • Parameters

    • dataset_name (str ) – The name of the new dataset.

    • description (str ) – A free text to describe the dataset.

    • tags (list ) – A list of tags (short strings) to classify the dataset.

    • raise_if_exists (bool ) – If False (the default) and there is a dataset with the name dataset_name, return the existing Dataset. If True and there is a dataset with the name dataset_name, raise ValueError exception.

    • dataset_project (str ) – A project name for the newly created dataset.

  • Return type

    Dataset

  • Returns

    A new Dataset object for the newly created dataset.


DatasetVersion.get_current

classmethod get_current(dataset_id=None, dataset_name=None, auto_upload_destination=None, local_dataset_root_path=None, dataset_project=None)

Return a DatasetVersion object for the current write-enabled version of the dataset

  • Parameters

    • dataset_id (str ) – The ID of the dataset of the version to retrieve.

    • dataset_name (str ) – The name of the dataset of the version to retrieve.

    • auto_upload_destination (str ) – If specified any local file linked by a SingleFrame/FrameGroup, will be automatically uploaded to the destination storage.

    • Path ] local_dataset_root_path (Union [ str , ) – Required if auto_upload_destination is provided. It should point to the common folder for all local source files

    • dataset_project (Optional[str]) – The project of the dataset to retrieve.

    • local_dataset_root_path (Optional [ Union [ str , pathlib2.Path ] ] ) –

  • Return type

    ForwardRef

  • Returns

    DatasetVersion object representing the selected version.

info

dataset_id and dataset_name are mutually exclusive. Setting both to non-None values will raise a UsageError exception.


DatasetVersion.remove_version

classmethod remove_version(dataset_id=None, dataset_name=None, version_id=None, version_name=None, force=False, dataset_project=None)

Remove a dataset’s version from the system.

info

dataset_id and dataset_name are mutually exclusive. Setting both to non-None values will raise a UsageError exception.

info

version_id and version_name are mutually exclusive. Setting both to non-None values will raise a UsageError exception.

  • Parameters

    • dataset_id (str ) – The ID of the dataset to be removed.

    • dataset_name (str ) – The name of the dataset to be removed.

    • version_id (str ) – The ID of the version to be removed.

    • version_name (str ) – The name of the version to be removed.

    • force (bool ) – If True, delete even if version is published. Default: False

    • dataset_project (str ) – The project of the dataset to be removed.

  • Return type

    None


DatasetVersion.get_version

classmethod get_version(dataset_id=None, dataset_name=None, version_id=None, version_name=None, auto_upload_destination=None, local_dataset_root_path=None, raise_on_multiple=False, dataset_project=None)

Return a DatasetVersion object for a specific version

info

If no version name/id is provided, the current version of the dataset is returned.

info

dataset_id and dataset_name are mutually exclusive. Setting both to non-None values will raise a UsageError exception.

  • Parameters

    • dataset_id (str ) – The ID of the dataset of the version to retrieve.

    • dataset_name (str ) – The name of the dataset of the version to retrieve.

    • version_id (str ) – [optional] The ID of the version to retrieve.

    • version_name (str ) – [optional] The name of the version to retrieve.

    • auto_upload_destination (str ) – If specified any local file linked by a SingleFrame/FrameGroup, will be automatically uploaded to the destination storage.

    • Path ] local_dataset_root_path (Union [ str , ) – Required if auto_upload_destination is provided. It should point to the common folder for all local source files

    • raise_on_multiple (bool ) – Raise error if multiple versions are found

    • dataset_project (str ) – The project of the dataset of the version to retrieve.

    • local_dataset_root_path (Optional [ Union [ str , pathlib2.Path ] ] ) –

  • Return type

    ForwardRef

  • Returns

    DatasetVersion object representing the selected version.


DatasetVersion.get_single_frame

classmethod get_single_frame(frame_id, dataset_id=None, dataset_name=None, version_id=None, version_name=None, dataset_project=None)

Return a SingleFrame / FrameGroup object with the requested frame_id (UUID) from a specific dataset version

  • Parameters

    • frame_id (str ) – The UUID of the requested frame id

    • dataset_id (str ) – The ID of the dataset of the version to retrieve.

    • dataset_name (str ) – The name of the dataset of the version to retrieve.

    • version_id (str ) – The ID of the version to retrieve.

    • version_name (str ) – The name of the version to retrieve.

    • dataset_project (str ) – The project of the dataset of the version to retrieve.

  • Return type

    Union[FrameGroup, SingleFrame]

  • Returns

    SingleFrame / FrameGroup object representing the requested frame

info

dataset_id and dataset_name are mutually exclusive. Setting both to non-None values will raise a UsageError exception.


DatasetVersion.get_frames_by_source

classmethod get_frames_by_source(source_uri, dataset_id=None, dataset_name=None, version_id=None, version_name=None, dataset_project=None)

Return a list of SingleFrame / FrameGroup objects with the requested source_uri pattern from a specific dataset version

  • Parameters

    • source_uri (str ) – Source uri match pattern. Examples: ‘/home/folder/’ or ‘/folder/’ or ‘https://domain.com/folder/’ or ‘s3://bucket/folder/*’ etc.

    • dataset_id (str ) – The ID of the dataset of the version to retrieve.

    • dataset_name (str ) – The name of the dataset of the version to retrieve.

    • version_id (str ) – The ID of the version to retrieve.

    • version_name (str ) – The name of the version to retrieve.

    • dataset_project (str ) – The project of the dataset of the version to retrieve.

  • Return type

    List[Union[SingleFrame, FrameGroup]]

  • Returns

    list of SingleFrame / FrameGroup object representing the requested frame

info

dataset_id and dataset_name are mutually exclusive. Setting both to non-None values will raise a UsageError exception.


DatasetVersion.get_frames_by_ids

classmethod get_frames_by_ids(frame_ids, projection=None, dataset_id=None, dataset_name=None, version_id=None, version_name=None, dataset_project=None)

Return a list of SingleFrame / FrameGroup objects with the requested frame IDs from a specific dataset version

info

Calling DatasetVersion.get_frames_by_ids is deprecated starting version 3.8, and will be removed in favor of using the instance method call dataset_version.get_frames_by_ids by 4Q 2023

  • Parameters

    • frame_ids (Collection[str]) – A collection of frame ID strings.

    • projection (Optional[Collection[str]]) – Used to select which parts of the frame will be returned. Each string represents a field or sub-field (using dot-separated notation). In order to specify a specific array element, use array index as a field name. To specify all array elements, use ‘*’. To see supported fields for projection, see the schema at backend_api.services.frames.Frame. If this argument is set, the values the iterator returns are dictionaries representing each frame

      For example:

      dataview.get_iterator(projection=['id', 'dataset.id', 'sources'])
      # will return an iterator that yields dictionaries with the following fields:
      # {
      # 'id': '514504adbb6a91620eefa3e21ecfcc31',
      # 'dataset': {
      # 'id': 'df3638ec95454589bf86ba97f344f697'
      # },
      # 'sources': [
      # {
      # 'id': 'Frame',
      # 'uri': 'https://clearml-public.s3.amazonaws.com/datasets/food_dataset/pizza/3724187.jpg',
      # 'timestamp': 0,
      # 'preview': {
      # 'uri': 'https://clearml-public.s3.amazonaws.com/datasets/food_dataset/pizza/3724187.jpg',
      # 'timestamp': 0
      # }
      # }
      # ]
      # }
    • dataset_id (str ) – The ID of the dataset of the version to retrieve.

    • dataset_name (str ) – The name of the dataset of the version to retrieve.

    • version_id (str ) – The ID of the version to retrieve.

    • version_name (str ) – The name of the version to retrieve.

    • dataset_project (str ) – The project of the dataset of the version to retrieve.

  • Return type

    Union[List[Union[allegroai.dataframe.singleframe.SingleFrame, allegroai.dataframe.framegroup.FrameGroup]], List[dict]]

info

When calling this method as an instance method, dataset_id, dataset_name, dataset_project, version_id, and version_name are not required.

  • Return type

    Union[List[Union[SingleFrame, FrameGroup]], List[dict]]

  • Returns

    A list of SingleFrame / FrameGroup objects or a list of dicts representing the requested frames.

  • Parameters

    • frame_ids (Collection [ str ] ) –

    • projection (Optional [ Collection [ str ] ] ) –

    • dataset_id (Optional [ str ] ) –

    • dataset_name (Optional [ str ] ) –

    • version_id (Optional [ str ] ) –

    • version_name (Optional [ str ] ) –

    • dataset_project (Optional [ str ] ) –

info

dataset_id and dataset_name are mutually exclusive. Setting both to non-None values will raise a UsageError exception.


DatasetVersion.create_snapshot

classmethod create_snapshot(version_name=None, version_id=None, dataset_name=None, dataset_id=None, publish_name=None, publish_comment=None, publish_metadata=None, child_name=None, child_comment=None, child_metadata=None, dataset_project=None)

Publishes the specified version and creates a draft child version

  • Parameters

    • version_name (str ) – The name of the draft version for the snapshot.

    • version_id (str ) – The ID of the draft version for the snapshot.

    • dataset_name (str ) – The name of the dataset.

    • dataset_id (str ) – The ID of the dataset to create the version in.

    • publish_name (str ) – New name for the published version. The default value is ‘snapshot <date-time>’.

    • publish_comment (str ) – New comment for the published version. The default value is ‘published at <date-time> by <user>’.

    • publish_metadata (dict ) – User-specified metadata object for the published version. Keys can not include ‘$’ and ‘.’.

    • child_name (str ) – Name for the child version. If not provided then the name of the parent version is taken.

    • child_comment (str ) – Comment for the child version.

    • child_metadata (dict ) – User-specified metadata object for the child version. Keys must not include ‘$’ and ‘.’.

    • dataset_project (str ) – The project of the dataset

  • Return type

    ForwardRef

  • Returns

    DatasetVersion object representing the new draft child version.

info

If no version_name/id is provided, the current version of the dataset is the snapshot version.

info

dataset_id and dataset_name are mutually exclusive. Setting both to non-None values will raise a UsageError exception.

info

version_id and version_name are mutually exclusive. Setting both to non-None values will raise a UsageError exception.


DatasetVersion.create_version

classmethod create_version(version_name, description=None, dataset_id=None, dataset_name=None, parent_version_ids=None, parent_version_names=None, raise_if_exists=False, auto_upload_destination=None, local_dataset_root_path=None, dataset_project=None)

Create a new version in a dataset with a specific name.

If a version by that name already exists and in draft mode (i.e. writable), return that one, unless raise_if_exists is True, than raise ValueError

  • Parameters

    • version_name (str ) – The name of the new version.

    • description (str ) – Description of the new dataset version

    • dataset_id (str ) – The ID of the dataset to create the version in.

    • dataset_name (str ) – The name of the dataset to create the version in.

    • parent_version_ids (list ) – A list of the new version parents IDs. All IDs must be existing version’s IDs in this dataset. Currently support only a single parent for version. This is a list for future compatibility.

    • parent_version_names (list ) – A list of the new version parents names. All names must be existing version’s names in this dataset. Currently support only a single parent for version. This is a list for future compatibility.

    • raise_if_exists (bool ) – If True and a version by name name already exists, raise ValueError. If False and a version by that name already exists, return it.

    • auto_upload_destination (str ) – If specified any local file linked by a SingleFrame/FrameGroup, will be automatically uploaded to the destination storage.

    • Path ] local_dataset_root_path (Union [ str , ) – Required if auto_upload_destination is provided. It should point to the common folder for all local source files

    • dataset_project (str ) – The project of dataset to create the version in.

    • local_dataset_root_path (Optional [ Union [ str , pathlib2.Path ] ] ) –

  • Return type

    ForwardRef

  • Returns

    New DatasetVersion object representing the new version.

info

dataset_id and dataset_name are mutually exclusive. Setting both to non-None values will raise a UsageError exception.


DatasetVersion.get_versions

classmethod get_versions(dataset_name=None, dataset_id=None, only_published=False, only_draft=False, dataset_project=None)

Return a list of all versions in a dataset.

  • Parameters

    • dataset_name (str ) – The name of the dataset. If several datasets with this name exist, select an arbitrary one.

    • dataset_id (str ) – The ID of the dataset to list.

    • only_published (bool ) – If True, return only published versions. If False, return all versions.

    • only_draft (bool ) – If True, return only draft (write enabled) versions. If False, return all versions.

    • dataset_project (Optional[str]) – The project of the dataset to list

  • Return type

    List[ForwardRef]

  • Returns

    A list of DatasetVersion, one for each version of the dataset. Versions are sorted by update time, from latest updated ([0]) to oldest


DatasetVersion.get_datasets

classmethod get_datasets(tags=None)

Return a list of all the dataset in the system, sorted by created time.

  • Parameters

    tags (list ) – Filter based on the requested list of tags (strings). To exclude a tag add “-” prefix to the tag. Example: ["best", "-debug"]. The default behaviour is to join all tags with a logical “OR” operator. To join all tags with a logical “AND” operator instead, use “__$all” as the first string, for example:

    ["__$all", "best", "experiment", "ever"]

    To join all tags with AND, but exclude a tag use “__$not” before the excluded tag, for example:

    ["__$all", "best", "experiment", "ever", "__$not", "internal", "__$not", "test"]

    The “OR” and “AND” operators apply to all tags that follow them until another operator is specified. The NOT operator applies only to the immediately following tag. For example:

    ["__$all", "a", "b", "c", "__$or", "d", "__$not", "e", "__$and", "__$or" "f", "g"]

    This example means (“a” AND “b” AND “c” AND (“d” OR NOT “e”) AND (“f” OR “g”)). See https://clear.ml/docs/latest/docs/clearml_sdk/task_sdk/#tag-filters for more information.

  • Return type

    List[Dataset]

  • Returns

    A list of datasets.Dataset, one for each dataset. Datasets are sorted by created time, from the oldest to the newest


get_iterator

get_iterator(projection=None)

Get an iterator for this version.

  • Parameters

    projection (Optional [ Sequence [ str ] ] ) – Used to select which parts of the frame will be returned. Each string represents a field or sub-field (using dot-separated notation). In order to specify a specific array element, use array index as a field name. To specify all array elements, use ‘*’. If this argument is set, the values the iterator returns are dictionaries representing each frame

    For example:

    version.get_iterator(projection=['id', 'dataset.id', 'sources'])
    # will return an iterator that yields dictionaries with the following fields:
    # {
    # 'id': '514504adbb6a91620eefa3e21ecfcc31',
    # 'dataset': {
    # 'id': 'df3638ec95454589bf86ba97f344f697'
    # },
    # 'sources': [
    # {
    # 'id': 'Frame',
    # 'uri': 'https://clearml-public.s3.amazonaws.com/datasets/food_dataset/pizza/3724187.jpg',
    # 'timestamp': 0,
    # 'preview': {
    # 'uri': 'https://clearml-public.s3.amazonaws.com/datasets/food_dataset/pizza/3724187.jpg',
    # 'timestamp': 0
    # }
    # }
    # ]
    # }
  • Return type

    Generator[Union[“DatasetVersion.Frame”, dict]]

  • Returns

    An iterator on all the version’s frames.


add_frames

add_frames(frames, warn_on_duplicate_frames=False, batch_size=1000, refresh_version_stats=True, auto_upload_destination=None, local_dataset_root_path=None, force_upload=False, progress_report=1, register_on_upload_failure=False, upload_retries=5, src_to_dst_mapping=None, unregister_on_upload_fail=True)

Add frames to this DatasetVersion. NOTICE! If frames already contain frame.id field, they will update (overwrite) existing frames. If not provided, frame.id is generated based on the source URI. If a local file should be uploaded but has already been previously uploaded, the existing URI for that file will be reused, otherwise the file will be uploaded.

info

Only available if version is still in draft (writable) mode

  • Parameters

    • frames (list ) – A list of new frames to save.

    • warn_on_duplicate_frames (bool ) – If True, issue a warning when adding a frame with an ID that was previously added to this instance (default False)

    • batch_size (int) – Number of frames in a single add request (default: 1000), batch_size affects the speed of the upload, versus reliability. It does not limit the number of frames per call and in most cases there is no need to change it.

    • refresh_version_stats (Optional[bool]) – Automatically call commit_version after adding frames to refresh this version’s statistics.

    • auto_upload_destination (Optional[str]) – If specified any local file linked by a SingleFrame/FrameGroup, will be automatically uploaded to the destination storage. Examples: ‘s3://bucket/datasets/’, ‘gs://bucket/dataset’, ‘azure://bucket/dataset’, ‘http://clearml-server/bucket/dataset’ Notes:

    1. The uploaded files will keep the same structure inside the designation storage under dataset_id/version_name.version_id/ folders

      1. If a file content hash is already registered, it will automatically link to

      the existing remote file instead of re-uploading the local copy

      2. Inside the dataset/version folder the files are stored in the same path as on the local storage,

      relatively the provided local_root_dataset_folder
    • local_dataset_root_path (Union[str, Path, None]) – Required if auto_upload_destination is provided. Mutually_exclusive with src_to_dst_mapping

    • local_dataset_root_path – Required if auto_upload_destination is provided. It should point to the common folder for all local source files, This root folder is used to detect the relative path of a single source file, to be uploaded to the remote storage. Example: 'auto_upload_destination='s3://bucket/datasets/', local_dataset_root_path='/home/user/data/' will make sure a file ‘/home/user/data/images/01/1.jpg’ will be uploaded to:

      ’s3://bucket/datasets/dataset_id/version_id/images/01/1.jpg’
    • force_upload (Optional[bool]) – If True and auto_upload_destination is provided, will force to upload the frames

    • progress_report (Optional[int]) – Report frame uploaded every progress_report frames uploaded/registered, at batch_size granularity. (default: report every batch)

    • register_on_upload_failure – If True, register the frames even when they fail uploading

    • upload_retries (int) – The number of times the upload of a frame should be retried in case of failure, before marking the frame as failed on upload and continuing to upload the other frames

    • src_to_dst_mapping (Optional[Dict[str, str]]) – A dictionary mapping the source of the frames to the upload destination. Each source found in the dictionary will be uploaded to the corresponding destination. Mutually_exclusive with auto_upload_destination

    • unregister_on_upload_fail (bool) – A boolean that controls whether to delete frames that failed to be uploaded.

  • Return type

    List[Dict]

  • Returns

    A list containing the frames that failed to upload or register. Each entry in the list is a dictionary with the following key-value pairs:

      - ‘frame’ - the frame that failed to be added
    - ‘error’ - a string that describes the error
    - ‘error_type’ - can be ‘upload’, ‘validation’ or ‘register’. Indicates where the error occurred

update_frames

update_frames(frames, batch_size=1000, refresh_version_stats=True, without_fields=None)

Update existing frames in this DatasetVersion.

Find each frame by its ID, and change its properties to match that of the frame object passed in frames. Frames exist in a version if they were previously added (e.g. by update_frames), or if they exist in a parent version. If the frame object does not have an ID, create a new frame.

info

Only available if version is still in draft (writable) mode

  • Parameters

    • frames (list ) – A list of frames to update.

    • batch_size (int ) – Number of frames in a single update request (default: 1000) batch_size affects the speed of the upload, versus reliability. It does not limit the number of frames per call and in most cases there is no need to change it.

    • refresh_version_stats (Optional[bool]) – Automatically call commit_version after updating frames to refresh this version’s statistics.

    • without_fields (Optional[List[str]]) – A list of fields to filter out of the frame object, when sending the update call. These fields correspond to the fields in allegroai.backend_api.services.datasets.Frame. When this list is provided, the call will generate an update operation, otherwise an add operation will be used (see add_frames). Use a non-None value (such as [] or False) in this parameter to specify an update operation without providing any fields.

      info

      when using an update operation, removed frame fields are ignored (e.g. update cannot be used to remove a field from the meta structure).

      For example, to avoid sending the metadata:

      dataset_version.update_frames(frames, without_fields=["meta"])
  • Return type

    None


delete_frames

delete_frames(frames, batch_size=1000, refresh_version_stats=True, delete_sources=False)

Delete frames from this DatasetVersion.

Frames may be represented by an ID string, or a DatasetVersion.Frame object. Frames are deleted by their IDs, all other frame attributes (if exists) are ignored.

info

Only available if version is still in draft (writable) mode.

  • Parameters

    • frames (Sequence[Union[FrameGroup, SingleFrame, dict, ForwardRef]]) – A list of a frame objects, or frame IDs (string).

    • batch_size (int ) – Number of frame ids in a single delete request (default: 1000) batch_size affects the speed of the upload, versus reliability. It does not limit the number of frames per call and in most cases there is no need to change it.

    • refresh_version_stats (Optional[bool]) – Automatically call commit_version after deleting frames to refresh this version’s statistics.

    • delete_sources (bool) – Delete sources associated with the deleted frames in the dataset. Supported source locations are: s3, gs and azure. In case a connection cannot be established with the cloud provider or a source deletion failed, the operation will abort.

  • Return type

    None


get_bulk_context

get_bulk_context(flush_threshold=None, log=None, refresh_version_stats=True, auto_upload_destination=None, local_dataset_root_path=None, allow_update=False)

Get a context manager for bulk updates to this version.

The bulk context allows add/edit/remove data frames on this version in bulks instead of one by one.

info

There can only be one BulkContext per DatasetVersion. A second call to get_bulk_context will return the same object.

info

only available if version is still in draft (writable) mode.

  • Parameters

    • flush_threshold (int ) – Commit the updates to the frames every flush_threshold updates. An update is a call to one of BulkContext.add_frame, BulkContext.update_frame, or BulkContext.delete_frame.

    • log (Optional[Logger]) – Logger object for the context to log to. Defaults to the datasetversion module logger.

    • refresh_version_stats (Optional[bool]) – Automatically call commit_version after deleting frames to refresh this version’s statistics.

    • auto_upload_destination (Optional[str]) – If specified any local file linked by a SingleFrame/FrameGroup, will be automatically uploaded to the destination storage. Examples: ‘s3://bucket/datasets/’, ‘gs://bucket/dataset’, ‘azure://bucket/dataset’, ‘http://clearml-server/bucket/dataset’

      Notes:

      1. The uploaded files will keep the same structure inside the designation storage under
      dataset_id/version_name.version_id/ folders
      2. If a file content hash is already registered, it will automatically link to
      the existing remote file instead of re-uploading the local copy
      3. Inside the dataset/version folder the files are stored in the same path as on the local storage,
      relative to the provided local_root_dataset_folder
    • local_dataset_root_path (Union[str, Path, None]) – Required if auto_upload_destination is provided. It should point to the common folder for all local source files, This root folder is used to detect the relative path of a single source file, to be uploaded to the remote storage. Example: 'auto_upload_destination='s3://bucket/datasets/', local_dataset_root_path='/home/user/data/' will make sure a file ‘/home/user/data/images/01/1.jpg’ will be uploaded to: ‘s3://bucket/datasets/dataset_id/version_id/images/01/1.jpg’

    • allow_update (bool) – If False (default), all frame operations will use the “add” action and “update” will not be used (i.e. even frames collected using the BulkContext.update() call will be added, not updated). This is an advanced setting, please change only if you understand the limitations of using update. Note that when using update, provided frame data is merged with the existing indexed frame data - this means frame fields cannot be removed when using the update operation.

  • Return type

    ForwardRef

  • Returns

    A bulk update context manager for this DatasetVersion


flush

flush(refresh_version_stats=True)

Send any outstanding version changes.

If a BulkContext was obtained by get_bulk_context, any updates made using it are sent to the server. If not, this is a no-op.

  • Parameters

    refresh_version_stats (Optional[bool]) – Automatically call commit_version to refresh this version’s statistics.

  • Return type

    None


commit_version

commit_version(kwargs)**

Commit this draft DatasetVersion, with all the changes made so far.

Committing a version merges changes done to it with the parent version. Further changes to the version are still possible. This is a must step before publishing the version.

danger

This is a blocking method and may take time to finish.

  • Return type

    CallResult

  • Parameters

    kwargs (Any ) –


publish_version

publish_version()

Publish this DatasetVersion.

After publishing a version it is no longer a draft version and no further changes are allowed for this version.

  • Return type

    bool

  • Returns

    True if successful, False otherwise.


get_stats

get_stats()

Returns this version’s statistics

  • Return type

    Statistics


get_parent

get_parent()

Returns the ID of this version’s parent version

  • Return type

    str


get_metadata

get_metadata()

  • Return type

    dict

  • Returns

    return metadata (dict) of user defined values stored for the specific Dataset Version


set_metadata

set_metadata(metadata)

Store metadata (dict) of user defined values stored for the specific Dataset Version

  • Parameters

    metadata (dict ) – key/value dictionary (with support for nested dictionaries)

  • Return type

    bool

  • Returns

    True if successful (locked/published versions cannot change version metadata)


set_masks_labels

set_masks_labels(mask_value_label_mapping)

Store a global (dataset version wide) lookup for per pixel mask values to labels. For example:

{
(0,0,0): ["background"],
(1,1,1): ["person", "sitting"],
(2,2,2): ["cat"],
}

Pixel masks label lookup is stored as a property on the dataset version metadata. Specifically: dataset.get_metadata()[‘mask_labels’] = {…}

  • Parameters

    mask_value_label_mapping (dict ) – key/value dictionary. Key is a tuple of integers, and value is a list/tuple of strings

  • Return type

    bool

  • Returns

    True if successful (locked/published versions cannot change version metadata)


get_masks_labels

get_masks_labels()

Get the global (dataset version wide) lookup for per pixel mask values to labels. For example:

{
(0,0,0): ["background"],
(1,1,1): ["person", "sitting"],
(2,2,2): ["cat"],
}

Pixel masks label lookup is stored as a property on the dataset version metadata. Specifically: dataset.get_metadata()[‘mask_labels’] = {…}

  • Return type

    Dict[tuple, tuple]

  • Returns

    key/value dictionary. key is a tuple of integers, and value is a list/tuple of strings