Skip to main content

StorageManager

class StorageManager()

StorageManager is helper interface for downloading & uploading files to supported remote storage Support remote servers: http(s)/S3/GS/Azure/File-System-Folder Cache is enabled by default for all downloaded remote urls/files


StorageManager.download_file

classmethod download_file(remote_url, local_folder=None, overwrite=False, skip_zero_size_check=False, silence_errors=False)

Download remote file to the local machine, maintaining the sub folder structure from the remote storage.

info

If we have a remote file s3://bucket/sub/file.ext then StorageManager.download_file(‘s3://bucket/sub/file.ext’, ‘~/folder/’) will create ~/folder/sub/file.ext

  • Parameters

    • remote_url (str ) – Source remote storage location, path of remote_url will be created under the target local_folder. Supports S3/GS/Azure and shared filesystem. Example: ‘s3://bucket/data/’

    • overwrite (bool ) – If False, and target files exist do not download. If True, always download the remote files. Default False.

    • skip_zero_size_check (bool ) – If True, no error will be raised for files with zero bytes size.

    • silence_errors (bool ) – If True, silence errors that might pop up when trying to download files stored remotely. Default False

    • local_folder (Optional [ str ] ) –

  • Return type

    Optional[str]

  • Returns

    Path to downloaded file or None on error


StorageManager.download_folder

classmethod download_folder(remote_url, local_folder=None, match_wildcard=None, overwrite=False, skip_zero_size_check=False, silence_errors=False, max_workers=None)

Download remote folder recursively to the local machine, maintaining the sub folder structure from the remote storage.

info

If we have a remote file s3://bucket/sub/file.ext then StorageManager.download_folder(‘s3://bucket/’, ‘~/folder/’) will create ~/folder/sub/file.ext

  • Parameters

    • remote_url (str ) – Source remote storage location, tree structure of remote_url will be created under the target local_folder. Supports S3/GS/Azure and shared filesystem. Example: ‘s3://bucket/data/’

    • local_folder (str ) – Local target folder to create the full tree from remote_url. If None, use the cache folder. (Default: use cache folder)

    • match_wildcard (Optional[str]) – If specified only download files matching the match_wildcard Example: *.json

    • overwrite (bool ) – If False, and target files exist do not download. If True, always download the remote files. Default False.

    • skip_zero_size_check (bool ) – If True, no error will be raised for files with zero bytes size.

    • silence_errors (bool ) – If True, silence errors that might pop up when trying to download files stored remotely. Default False

    • max_workers (int ) – If value is set to a number, it will spawn the specified number of worker threads to download the contents of the folder in parallel. Otherwise, if set to None, it will internally use as many threads as there are logical CPU cores in the system (this is default Python behavior). Default None

  • Return type

    Optional[str]

  • Returns

    Target local folder


StorageManager.exists_file

classmethod exists_file(remote_url)

Check if remote file exists. Note that this function will return False for directories.

  • Parameters

    remote_url (str ) – The url where the file is stored. E.g. ‘s3://bucket/some_file.txt’, ‘file://local/file’

  • Return type

    bool

  • Returns

    True is the remote_url stores a file and False otherwise


StorageManager.get_file_size_bytes

classmethod get_file_size_bytes(remote_url, silence_errors=False)

Get size of the remote file in bytes.

  • Parameters

    • remote_url (str ) – The url where the file is stored. E.g. ‘s3://bucket/some_file.txt’, ‘file://local/file’

    • silence_errors (bool ) – Silence errors that might occur when fetching the size of the file. Default: False

  • Return type

    [int, None]

  • Returns

    The size of the file in bytes. None if the file could not be found or an error occurred.


StorageManager.get_local_copy

classmethod get_local_copy(remote_url, cache_context=None, extract_archive=True, name=None, force_download=False)

Get a local copy of the remote file. If the remote URL is a direct file access, the returned link is the same, otherwise a link to a local copy of the url file is returned. Caching is enabled by default, cache limited by number of stored files per cache context. Oldest accessed files are deleted when cache is full. One can also use this function to prevent the deletion of a file that has been cached, as the respective file will have its timestamp refreshed

  • Parameters

    • remote_url (str ) – remote url link (string)

    • cache_context (str ) – Optional caching context identifier (string), default context ‘global’

    • extract_archive (bool ) – if True, returned path will be a cached folder containing the archive’s content, currently only zip files are supported.

    • name (str ) – name of the target file

    • force_download (bool ) – download file from remote even if exists in local cache

  • Return type

    [str, None]

  • Returns

    Full path to local copy of the requested url. Return None on Error.


StorageManager.get_metadata

classmethod get_metadata(remote_url, return_full_path=False)

Get the metadata of the remote object. The metadata is a dict containing the following keys: name, size.

  • Parameters

    • remote_url (str ) – Source remote storage location, tree structure of remote_url will be created under the target local_folder. Supports S3/GS/Azure, shared filesystem and http(s). Example: ‘s3://bucket/data/’

    • return_full_path (bool) – True for returning a full path (with the base url)

  • Return type

    Optional[dict]

  • Returns

    A dict containing the metadata of the remote object. In case of an error, None is returned


StorageManager.list

classmethod list(remote_url, return_full_path=False, with_metadata=False)

Return a list of object names inside the base path or dictionaries containing the corresponding objects’ metadata (in case with_metadata is True)

  • Parameters

    • remote_url (str ) – The base path. For Google Storage, Azure and S3 it is the bucket of the path, for local files it is the root directory. For example: AWS S3: s3://bucket/folder will list all the files you have in s3://bucket-name/folder*/*. The same behaviour with Google Storage: gs://bucket/folder, Azure blob storage: azure://bucket/folder and also file system listing: /mnt/share/folder_

    • return_full_path (bool ) – If True, return a list of full object paths, otherwise return a list of relative object paths (default False).

    • with_metadata (bool) – Instead of returning just the names of the objects, return a list of dictionaries containing the name and metadata of the remote file. Thus, each dictionary will contain the following keys: name, size. return_full_path will modify the name of each dictionary entry to the full path.

  • Return type

    Optional[List[Union[str, dict]]]

  • Returns

    The paths of all the objects the storage base path under prefix or the dictionaries containing the objects’ metadata, relative to the base path. None in case of list operation is not supported (http and https protocols for example)


StorageManager.set_cache_file_limit

classmethod set_cache_file_limit(cache_file_limit, cache_context=None)

Set the cache context file limit. File limit is the maximum number of files the specific cache context holds. Notice, there is no limit on the size of these files, only the total number of cached files.

  • Parameters

    • cache_file_limit (int ) – New maximum number of cached files

    • cache_context (str ) – Optional cache context identifier, default global context

  • Return type

    int

  • Returns

    The new cache context file limit.


StorageManager.set_report_download_chunk_size

classmethod set_report_download_chunk_size(chunk_size_mb)

Set the download progress report chunk size (in MB). The chunk size determines how often the progress reports are logged: every time a chunk of data with a size greater than chunk_size_mb is downloaded, log the report. This function overwrites the sdk.storage.log.report_download_chunk_size_mb config entry

  • Parameters

    chunk_size_mb (int ) – The chunk size, in megabytes

  • Return type

    ()


StorageManager.set_report_upload_chunk_size

classmethod set_report_upload_chunk_size(chunk_size_mb)

Set the upload progress report chunk size (in MB). The chunk size determines how often the progress reports are logged: every time a chunk of data with a size greater than chunk_size_mb is uploaded, log the report. This function overwrites the sdk.storage.log.report_upload_chunk_size_mb config entry

  • Parameters

    chunk_size_mb (int ) – The chunk size, in megabytes

  • Return type

    ()


StorageManager.upload_file

classmethod upload_file(local_file, remote_url, wait_for_upload=True, retries=None)

Upload a local file to a remote location. remote url is the final destination of the uploaded file.

Examples:

upload_file('/tmp/artifact.yaml', 'http://localhost:8081/manual_artifacts/my_artifact.yaml')
upload_file('/tmp/artifact.yaml', 's3://a_bucket/artifacts/my_artifact.yaml')
upload_file('/tmp/artifact.yaml', '/mnt/share/folder/artifacts/my_artifact.yaml')
  • Parameters

    • local_file (str ) – Full path of a local file to be uploaded

    • remote_url (str ) – Full path or remote url to upload to (including file name)

    • wait_for_upload (bool ) – If False, return immediately and upload in the background. Default True.

    • retries (int ) – Number of retries before failing to upload file.

  • Return type

    str

  • Returns

    Newly uploaded remote URL.


StorageManager.upload_folder

classmethod upload_folder(local_folder, remote_url, match_wildcard=None)

Upload local folder recursively to a remote storage, maintaining the sub folder structure in the remote storage.

info

If we have a local file ~/folder/sub/file.ext then StorageManager.upload_folder(‘~/folder/’, ‘s3://bucket/’) will create s3://bucket/sub/file.ext

  • Parameters

    • local_folder (str ) – Local folder to recursively upload

    • remote_url (str ) – Target remote storage location, tree structure of local_folder will be created under the target remote_url. Supports Http/S3/GS/Azure and shared filesystem. Example: ‘s3://bucket/data/’

    • match_wildcard (str ) – If specified only upload files matching the match_wildcard Example: *.json Notice: target file size/date are not checked. Default True, always upload. Notice if uploading to http, we will always overwrite the target.

  • Return type

    Optional[str]

  • Returns

    Newly uploaded remote URL or None on error.