Skip to main content

DatasetVersion

class datasetversion.DatasetVersion()#

DatasetVersion represents a specific version in a dataset.

warning

Do not instantiate directly. Use DatasetVersion.get_version method instead.


BulkContext#

class BulkContext

A context manager to modify frames (i.e add/update/remove) in bulk.

Use DatasetVersion.get_bulk_context to obtain.

The bulk context allows modifying the version by adding/updating/deleting of frames one at a time, but the actual update request will happen in bulk. The update request (flush) will happen every :paramref:`~.DatasetVersion.get_bulk_context.flush_threshold` updates, or upon __exit__.

Create Bulk context for automatically flushing frames :type dv: DatasetVersion :param dv: DatasetVersion object to use :type flush_threshold: Optional[int] :param flush_threshold: If provided flush every X frames :type log: Optional[Logger] :param log: Optional, provide external logger :type refresh_version_stats: Optional[bool] :param refresh_version_stats: automatically refresh version statistics (default: True) :type auto_upload_destination: Optional[str] :param auto_upload_destination: If specified any local file linked by a SingleFrame/FrameGroup,

will be automatically uploaded to the destination storage.

  • Parameters

    local_dataset_root_path (Union[str, Path, None]) – Required if auto_upload_destination is provided. It should point to the common folder for all local source files


add_frame#

BulkContext.add_frame(frame, warn_on_duplicate_frames=False)

NOTICE! If frames already contain frame.id field, they will update (overwrite) existing frames, if frame.id does not exist in the dataset version, it will be added.

info

Only available if version is still in draft (writable) mode

  • Parameters

    • frame (DatasetVersion.Frame ) – The frame to add to the version.

    • warn_on_duplicate_frames (Optional[bool]) – If True, issue a warning when adding a frame with an ID that was previously added to this instance (default False)

  • Return type

    None


delete_frame#

BulkContext.delete_frame(frame)

Delete a frame from the current DatasetVersion.

The frame may be represented by an ID string, or a DatasetVersion.Frame object. Frames are deleted by their ID’s, all other frame attributes (if exists) are ignored.

info

Only available if version is still in draft (writable) mode.

  • Parameters

    frame (Union[FrameGroup, SingleFrame, str, ForwardRef]) – The frame to delete (frame object or ID string)

  • Return type

    None


flush#

BulkContext.flush()

Send any outstanding version changes.

Any updates made using this BulkContext are sent to the server.

  • Return type

    None


update_frame#

BulkContext.update_frame(frame)

Update an existing frame in the current DatasetVersion.

Find the frame by its ID, and change its properties to match that of the frame object passed in frame. Frames exists in a version if they were previously added (e.g by update_frame), or if they exist in a parent version. If the frame object does not have an ID, create a new frame.

info

Only available if version is still in draft (writable) mode.

  • Parameters

    frame (DatasetVersion.Frame ) – The frame to update.

  • Return type

    None


version_id#

property version_id

Version id string of this specific dataset/version

  • Return type

    str


version_name#

property version_name

Dataset version name, not necessarily unique

  • Return type

    str


dataset_id#

property dataset_id

Dataset id string of this specific dataset

  • Return type

    str


dataset_name#

property dataset_name

Dataset name, must be a unique name

  • Return type

    str


draft#

property draft

Draft flag of the dataset/version, i.e. is this version still writable or is it locked and cannot be changed.

  • Return type

    bool


last_updated#

property last_updated

Return the timestamp of the last updated frame in the dataset version

  • Return type

    datetime


comment#

property comment

Return the string comment of the specific Dataset Version

  • Return type

    str


DatasetVersion.create_new_dataset#

classmethod create_new_dataset(dataset_name=None, description=None, tags=None, raise_if_exists=False)

Create a new dataset in the system and return a Dataset object for it.

  • Parameters

    • dataset_name (str ) – The name of the new dataset.

    • description (str ) – A free text to describe the dataset.

    • tags (list ) – A list of tags (short strings) to classify the dataset.

    • raise_if_exists (bool ) – If False (the default) and there is a dataset with the name :paramref:`~.create.dataset_name`, return the existing Dataset. If True and there is a dataset with the name :paramref:`~.create.dataset_name`, raise ValueError exception.

  • Return type

    Dataset

  • Returns

    A new Dataset object for the newly created dataset.


DatasetVersion.get_current#

classmethod get_current(dataset_id=None, dataset_name=None, auto_upload_destination=None, local_dataset_root_path=None)

Return a DatasetVersion object for the current write-enabled version of the dataset

  • Parameters

    • dataset_id (str ) – The id of the dataset of the version to retrieve.

    • dataset_name (str ) – The name of the dataset of the version to retrieve.

    • auto_upload_destination (str ) – If specified any local file linked by a SingleFrame/FrameGroup, will be automatically uploaded to the destination storage.

    • Path ] local_dataset_root_path (Union [ str , ) – Required if auto_upload_destination is provided. It should point to the common folder for all local source files

    • local_dataset_root_path (Optional [ Union [ str , pathlib2.Path ] ] ) –

  • Return type

    ForwardRef

  • Returns

    DatasetVersion object representing the selected version.

info

:paramref:~.get_current.dataset_id and :paramref:~.get_current.dataset_name are mutually exclusive, setting both to non-None values will raise a UsageError exception.


DatasetVersion.remove_version#

classmethod remove_version(dataset_id=None, dataset_name=None, version_id=None, version_name=None, force=False)

Remove a dataset’s version from the system.

  • Parameters

    • dataset_id (str ) – The id of the dataset to be removed.

    • dataset_name (str ) – The name of the dataset to be removed.

    • version_id (str ) – The id of the version to be removed.

    • version_name (str ) – The name of the version to be removed.

    • force (bool ) – If True, delete even if version is published. Default: False

  • Return type

    None

info

:paramref:~.remove_version.dataset_id and :paramref:~.remove_version.dataset_name are mutually exclusive, setting both to non-None values will raise a UsageError exception.

info

:paramref:~.remove_version.version_id and :paramref:~.remove_version.version_name are mutually exclusive, setting both to non-None values will raise a UsageError exception.

  • Return type

    None

  • Parameters

    • dataset_id (Optional [ str ] ) –

    • dataset_name (Optional [ str ] ) –

    • version_id (Optional [ str ] ) –

    • version_name (Optional [ str ] ) –

    • force (bool ) –


DatasetVersion.get_version#

classmethod get_version(dataset_id=None, dataset_name=None, version_id=None, version_name=None, auto_upload_destination=None, local_dataset_root_path=None)

Return a DatasetVersion object for a specific version

  • Parameters

    • dataset_id (str ) – The id of the dataset of the version to retrieve.

    • dataset_name (str ) – The name of the dataset of the version to retrieve.

    • version_id (str ) – [optional] The id of the version to retrieve.

    • version_name (str ) – [optional] The name of the version to retrieve.

    • auto_upload_destination (str ) – If specified any local file linked by a SingleFrame/FrameGroup, will be automatically uploaded to the destination storage.

    • Path ] local_dataset_root_path (Union [ str , ) – Required if auto_upload_destination is provided. It should point to the common folder for all local source files

    • local_dataset_root_path (Optional [ Union [ str , pathlib2.Path ] ] ) –

  • Return type

    ForwardRef

  • Returns

    DatasetVersion object representing the selected version.

info

If no version name/id is provided, the current version of the dataset is returned.

info

:paramref:~.get_version.dataset_id and :paramref:~.get_version.dataset_name are mutually exclusive, setting both to non-None values will raise a UsageError exception.


DatasetVersion.get_single_frame#

classmethod get_single_frame(frame_id, dataset_id=None, dataset_name=None, version_id=None, version_name=None)

Return a SingleFrame / FrameGroup object with the requested frame_id (UUID) from a specific dataset version

  • Parameters

    • frame_id (str ) – The UUID of the requested frame id

    • dataset_id (str ) – The id of the dataset of the version to retrieve.

    • dataset_name (str ) – The name of the dataset of the version to retrieve.

    • version_id (str ) – The id of the version to retrieve.

    • version_name (str ) – The name of the version to retrieve.

  • Return type

    Union[FrameGroup, SingleFrame]

  • Returns

    SingleFrame / FrameGroup object representing the requested frame

info

:paramref:~.get_single_frame.dataset_id and :paramref:~.get_single_frame.dataset_name are mutually exclusive, setting both to non-None values will raise a UsageError exception.


DatasetVersion.get_frames_by_source#

classmethod get_frames_by_source(source_uri, dataset_id=None, dataset_name=None, version_id=None, version_name=None)

Return a list of SingleFrame / FrameGroup objects with the requested source_uri pattern from a specific dataset version

  • Parameters

    • source_uri (str ) – Source uri match pattern. Examples: ‘/home/folder/’ or ‘/folder/’ or ‘https://domain.com/folder/’ or ‘s3://bucket/folder/*’ etc.

    • dataset_id (str ) – The id of the dataset of the version to retrieve.

    • dataset_name (str ) – The name of the dataset of the version to retrieve.

    • version_id (str ) – The id of the version to retrieve.

    • version_name (str ) – The name of the version to retrieve.

  • Return type

    List[Union[SingleFrame, FrameGroup]]

  • Returns

    list of SingleFrame / FrameGroup object representing the requested frame

info

:paramref:~.get_frames_by_source.dataset_id and :paramref:~.get_frames_by_source.dataset_name are mutually exclusive, setting both to non-None values will raise a UsageError exception.


DatasetVersion.create_snapshot#

classmethod create_snapshot(version_name=None, version_id=None, dataset_name=None, dataset_id=None, publish_name=None, publish_comment=None, publish_metadata=None, child_name=None, child_comment=None, child_metadata=None)

Publishes the specified version and creates a draft child version :param str version_name: The name of the draft version for the snapshot. :param str version_id: The ID of the draft version for the snapshot. :param str dataset_name: The name of the dataset. :param str dataset_id: The ID of the dataset to create the version in. :param str publish_name: New name for the published version. The default value is ‘snapshot <date-time>’. :param str publish_comment: New comment for the published version. The default value is

‘published at <date-time> by <user>’.

  • Parameters

    • publish_metadata (dict ) – User-specified metadata object for the published version. Keys can not include ‘$’ and ‘.’.

    • child_name (str ) – Name for the child version. If not provided then the name of the parent version is taken.

    • child_comment (str ) – Comment for the child version.

    • child_metadata (dict ) – User-specified metadata object for the child version. Keys must not include ‘$’ and ‘.’.

    • version_name (Optional [ str ] ) –

    • version_id (Optional [ str ] ) –

    • dataset_name (Optional [ str ] ) –

    • dataset_id (Optional [ str ] ) –

    • publish_name (Optional [ str ] ) –

    • publish_comment (Optional [ str ] ) –

  • Return type

    ForwardRef

  • Returns

    DatasetVersion object representing the new draft child version.

info

If no version_name/id is provided, the current version of the dataset is the snapshot version.

info

:paramref:~.create_snapshot.dataset_id and :paramref:~.create_snapshot.dataset_name are mutually exclusive, setting both to non-None values will raise a UsageError exception.

info

:paramref:~.create_snapshot.version_id and :paramref:~.create_snapshot.version_name are mutually exclusive, setting both to non-None values will raise a UsageError exception.


DatasetVersion.create_version#

classmethod create_version(version_name, description=None, dataset_id=None, dataset_name=None, parent_version_ids=None, parent_version_names=None, raise_if_exists=False, auto_upload_destination=None, local_dataset_root_path=None)

Create a new version in a dataset with a specific name.

If a version by that name already exits and in draft mode (i.e. writable), return that one, unless :paramref:`~.create_version.raise_if_exists` is True, than raise ValueError

  • Parameters

    • version_name (str ) – The name of the new version.

    • description (str ) – Description of the new dataset version

    • dataset_id (str ) – The ID of the dataset to create the version in.

    • dataset_name (str ) – The name of the dataset to create the version in.

    • parent_version_ids (list ) – A list of the new version parents IDs. All ID’s must be existing version’s IDs in this dataset. Currently support only a single parent for version. This is a list for future compatibility.

    • parent_version_names (list ) – A list of the new version parents names. All names must be existing version’s names in this dataset. Currently support only a single parent for version. This is a list for future compatibility.

    • raise_if_exists (bool ) – If True, and the a version by name :paramref:`~.create_version.name` already exists, raise ValueError. If False and a version by that name already exists, return it.

    • auto_upload_destination (str ) – If specified any local file linked by a SingleFrame/FrameGroup, will be automatically uploaded to the destination storage.

    • Path ] local_dataset_root_path (Union [ str , ) – Required if auto_upload_destination is provided. It should point to the common folder for all local source files

    • local_dataset_root_path (Optional [ Union [ str , pathlib2.Path ] ] ) –

  • Return type

    ForwardRef

  • Returns

    New DatasetVersion object representing the new version.

info

:paramref:~.create_version.dataset_id and :paramref:~.create_version.dataset_name are mutually exclusive, setting both to non-None values will raise a UsageError exception.


DatasetVersion.get_versions#

classmethod get_versions(dataset_name=None, dataset_id=None, only_published=False, only_draft=False)

Return a list of all versions in a dataset.

  • Parameters

    • dataset_name (str ) – The name of the dataset. If several datasets with this name exists, select an arbitrary one.

    • dataset_id (str ) – The ID of the dataset to list.

    • only_published (bool ) – If True, return only published versions. If False, return all versions.

    • only_draft (bool ) – If True, return only draft (write enabled) versions. If False, return all versions.

  • Return type

    List[ForwardRef]

  • Returns

    A list of DatasetVersion, one for each version of the dataset. Versions are sorted by update time, from latest updated ([0]) to oldest


DatasetVersion.get_datasets#

classmethod get_datasets()

Return a list of all the dataset in the system, sorted by created time.

  • Return type

    List[None]

  • Returns

    A list of datasets.Dataset, one for each dataset. Datasets are sorted by created time, from the oldest to the newest


get_iterator#

get_iterator()

Get an iterator for this version.

  • Return type

    Generator[“DatasetVersion.Frame”]

  • Returns

    An iterator on all the version’s frames.


add_frames#

add_frames(frames, warn_on_duplicate_frames=False, batch_size=1000, refresh_version_stats=True, auto_upload_destination=None, local_dataset_root_path=None, force_upload=False, progress_report=1)

Add frames to this DatasetVersion. NOTICE! If frames already contain frame.id field, they will update (overwrite) existing frames, if frame.id does not exist in the dataset version, it will be added.

info

Only available if version is still in draft (writable) mode

  • Parameters

    • frames (list ) – A list of new frames to save.

    • warn_on_duplicate_frames (bool ) – If True, issue a warning when adding a frame with an ID that was previously added to this instance (default False)

    • batch_size (int) – Number of frames in a single add request (default: 1000), batch_size affects the speed of the upload, versus reliability. It does not limit the number of frames per call and in most case there is no need to change it.

    • refresh_version_stats (Optional[bool]) – Automatically call commit_version after adding frames to refresh this version’s statistics.

    • auto_upload_destination (Optional[str]) – If specified any local file linked by a SingleFrame/FrameGroup, will be automatically uploaded to the destination storage. Examples: ‘s3://bucket/datasets/’, ‘gs://bucket/dataset’, ‘azure://bucket/dataset’, ‘http://clearml-server/bucket/dataset’ Notes:

    1. The uploaded files will keep the same structure inside the designation storage under dataset_id/version_name.version_id/ folders

      1. If a file content hash is already registered, it will automatically link to
      the existing remote file instead of re-uploading the local copy
      2. Inside the dataset/version folder the files are stored in the same path as on the local storage,
      relatively the provided local_root_dataset_folder
    • local_dataset_root_path (Union[str, Path, None]) – Required if auto_upload_destination is provided. It should point to the common folder for all local source files, This root folder is used to detect the relative path of a single source file, to be uploaded to the remote storage. Example: auto_upload_destination=’s3://bucket/datasets/’, local_dataset_root_path=’/home/user/data/’ will make sure a file ‘/home/user/data/images/01/1.jpg’ will be uploaded to:

      ’s3://bucket/datasets/dataset_id/version_id/images/01/1.jpg’
    • force_upload (Optional[bool]) – If True and auto_upload_destination is provided, will force to upload the frames

    • progress_report (Optional[int]) – Report frame uploaded every progress_report frames uploaded/registered, at batch_size granularity. (default: report every batch)

  • Return type

    None


update_frames#

update_frames(frames, batch_size=1000, refresh_version_stats=True)

Update existing frames in this DatasetVersion.

Find each frame by its ID, and change its properties to match that of the frame object passed in frames. Frames exists in a version if they were previously added (e.g by update_frames), or if they exist in a parent version. If the frame object does not have an ID, create a new frame.

info

Only available if version is still in draft (writable) mode

  • Parameters

    • frames (list ) – A list of frames to update.

    • batch_size (int ) – Number of frames in a single update request (default: 1000) batch_size affects the speed of the upload, versus reliability. It does not limit the number of frames per call and in most case there is no need to change it.

    • refresh_version_stats (Optional[bool]) – Automatically call commit_version after updating frames to refresh this version’s statistics.

  • Return type

    None


delete_frames#

delete_frames(frames, batch_size=1000, refresh_version_stats=True)

Delete frames from this DatasetVersion.

Frames may be represented by an ID string, or a DatasetVersion.Frame object. Frames are deleted by their ID’s, all other frame attributes (if exists) are ignored.

info

Only available if version is still in draft (writable) mode.

  • Parameters

    • frames (Sequence[Union[FrameGroup, SingleFrame, dict, ForwardRef]]) – A list of a frame objects, or frame IDs (string).

    • batch_size (int ) – Number of frame ids in a single delete request (default: 1000) batch_size affects the speed of the upload, versus reliability. It does not limit the number of frames per call and in most case there is no need to change it.

    • refresh_version_stats (Optional[bool]) – Automatically call commit_version after deleting frames to refresh this version’s statistics.

  • Return type

    None


get_bulk_context#

get_bulk_context(flush_threshold=None, log=None, refresh_version_stats=True, auto_upload_destination=None, local_dataset_root_path=None)

Get a context manager for bulk updates to this version.

The bulk context allows add/edit/remove data frames on this version in bulks instead of one by one.

info

There can only be one BulkContext per DatasetVersion. A second call to get_bulk_context will return the same object.

info

only available if version is still in draft (writable) mode.

  • Parameters

    • flush_threshold (int ) – Commit the updates to the frames every :paramref:`~.get_bulk_context.flush_threshold` updates. An update is a call to one of BulkContext.add_frame, BulkContext.update_frame, or BulkContext.delete_frame.

    • log (Optional[Logger]) – Logger object for the context to log to. Defaults to the datasetversion module logger.

    • refresh_version_stats (Optional[bool]) – Automatically call commit_version after deleting frames to refresh this version’s statistics.

    • auto_upload_destination (Optional[str]) – If specified any local file linked by a SingleFrame/FrameGroup, will be automatically uploaded to the destination storage. Examples: ‘s3://bucket/datasets/’, ‘gs://bucket/dataset’, ‘azure://bucket/dataset’, ‘http://clearml-server/bucket/dataset’ Notes:

    1. The uploaded files will keep the same structure inside the designation storage under dataset_id/version_name.version_id/ folders

      1. If a file content hash is already registered, it will automatically link to
      the existing remote file instead of re-uploading the local copy
      2. Inside the dataset/version folder the files are stored in the same path as on the local storage,
      relatively the provided local_root_dataset_folder
    • local_dataset_root_path (Union[str, Path, None]) – Required if auto_upload_destination is provided. It should point to the common folder for all local source files, This root folder is used to detect the relative path of a single source file, to be uploaded to the remote storage. Example: auto_upload_destination=’s3://bucket/datasets/’, local_dataset_root_path=’/home/user/data/’ will make sure a file ‘/home/user/data/images/01/1.jpg’ will be uploaded to:

      ’s3://bucket/datasets/dataset_id/version_id/images/01/1.jpg’
  • Return type

    ForwardRef

  • Returns

    A bulk update context manager for this DatasetVersion


flush#

flush(refresh_version_stats=True)

Send any outstanding version changes.

If a BulkContext was obtained by get_bulk_context, any updates made using it are sent to the server. If not, this is a no-op.

  • Parameters

    refresh_version_stats (Optional[bool]) – Automatically call commit_version to refresh this version’s statistics.

  • Return type

    None


commit_version#

commit_version(kwargs)**

Commit this draft DatasetVersion, with all the changes made so far.

Committing a version merges changes done to it with the parent version. Further changes to the version are still possible. This is a must step before publishing the version.

warning

This is a blocking method and may take time to finish.

  • Return type

    CallResult

  • Parameters

    kwargs (Any ) –


publish_version#

publish_version()

Publish this DatasetVersion.

After publishing a version it is no longer a draft version and no further changes are allowed for this version.

  • Return type

    bool

  • Returns

    True if successful, False otherwise.


get_stats#

get_stats()

Returns this version’s statistics

  • Return type

    None


get_parent#

get_parent()

Returns the ID of this version’s parent version

  • Return type

    str


get_metadata#

get_metadata()

  • Return type

    dict

  • Returns

    return metadata (dict) of user defined values stored for the specific Dataset Version


set_metadata#

set_metadata(metadata)

Store metadata (dict) of user defined values stored for the specific Dataset Version :param dict metadata: key/value dictionary (with support for nested dictionaries) :rtype: bool :return: True if successful (locked/published versions cannot change version metadata)

  • Parameters

    metadata (dict ) –

  • Return type

    bool


set_masks_labels#

set_masks_labels(mask_value_label_mapping)

Store a global (dataset version wide) lookup for per pixel mask values to labels. For example {

(0,0,0): [“background”], (1,1,1): [“person”, “sitting”], (2,2,2): [“cat”],

} pixel masks label lookup is stored as a property on the dataset version metadata. Specifically: dataset.get_metadata()[‘mask_labels’] = {…}

  • Parameters

    mask_value_label_mapping (dict ) – key/value dictionary.

  • Return type

    bool

key is a tuple of integers, and value is a list/tuple of strings :rtype: bool :return: True if successful (locked/published versions cannot change version metadata)


get_masks_labels#

get_masks_labels()

Get the global (dataset version wide) lookup for per pixel mask values to labels. For example {

(0,0,0): [“background”], (1,1,1): [“person”, “sitting”], (2,2,2): [“cat”],

} pixel masks label lookup is stored as a property on the dataset version metadata. Specifically: dataset.get_metadata()[‘mask_labels’] = {…}

  • Return type

    Dict[tuple, tuple]

  • Returns

    key/value dictionary. key is a tuple of integers, and value is a list/tuple of strings