StorageManager
class StorageManager()
StorageManager is helper interface for downloading & uploading files to supported remote storage Support remote servers: http(s)/S3/GS/Azure/File-System-Folder Cache is enabled by default for all downloaded remote urls/files
StorageManager.download_file
classmethod download_file(remote_url, local_folder=None, overwrite=False, skip_zero_size_check=False, silence_errors=False)
Download remote file to the local machine, maintaining the sub folder structure from the remote storage.
If we have a remote file s3://bucket/sub/file.ext then StorageManager.download_file(‘s3://bucket/sub/file.ext’, ‘~/folder/’) will create ~/folder/sub/file.ext
Parameters
remote_url (str ) – Source remote storage location, path of remote_url will be created under the target local_folder. Supports S3/GS/Azure and shared filesystem. Example: ‘s3://bucket/data/’
overwrite (bool ) – If False, and target files exist do not download. If True, always download the remote files. Default False.
skip_zero_size_check (bool ) – If True, no error will be raised for files with zero bytes size.
silence_errors (bool ) – If True, silence errors that might pop up when trying to download files stored remotely. Default False
local_folder (Optional [ str ] ) –
Return type
Optional
[str
]Returns
Path to downloaded file or None on error
StorageManager.download_folder
classmethod download_folder(remote_url, local_folder=None, match_wildcard=None, overwrite=False, skip_zero_size_check=False, silence_errors=False, max_workers=None)
Download remote folder recursively to the local machine, maintaining the sub folder structure from the remote storage.
If we have a remote file s3://bucket/sub/file.ext then StorageManager.download_folder(‘s3://bucket/’, ‘~/folder/’) will create ~/folder/sub/file.ext
Parameters
remote_url (str ) – Source remote storage location, tree structure of remote_url will be created under the target local_folder. Supports S3/GS/Azure and shared filesystem. Example: ‘s3://bucket/data/’
local_folder (str ) – Local target folder to create the full tree from remote_url. If None, use the cache folder. (Default: use cache folder)
match_wildcard (
Optional
[str
]) – If specified only download files matching the match_wildcard Example: *.jsonoverwrite (bool ) – If False, and target files exist do not download. If True, always download the remote files. Default False.
skip_zero_size_check (bool ) – If True, no error will be raised for files with zero bytes size.
silence_errors (bool ) – If True, silence errors that might pop up when trying to download files stored remotely. Default False
max_workers (int ) – If value is set to a number, it will spawn the specified number of worker threads to download the contents of the folder in parallel. Otherwise, if set to None, it will internally use as many threads as there are logical CPU cores in the system (this is default Python behavior). Default None
Return type
Optional
[str
]Returns
Target local folder
StorageManager.exists_file
classmethod exists_file(remote_url)
Check if remote file exists. Note that this function will return False for directories.
Parameters
remote_url (str ) – The url where the file is stored. E.g. ‘s3://bucket/some_file.txt’, ‘file://local/file’
Return type
bool
Returns
True is the remote_url stores a file and False otherwise
StorageManager.get_file_size_bytes
classmethod get_file_size_bytes(remote_url, silence_errors=False)
Get size of the remote file in bytes.
Parameters
remote_url (str ) – The url where the file is stored. E.g. ‘s3://bucket/some_file.txt’, ‘file://local/file’
silence_errors (bool ) – Silence errors that might occur when fetching the size of the file. Default: False
Return type
[int, None]
Returns
The size of the file in bytes. None if the file could not be found or an error occurred.
StorageManager.get_local_copy
classmethod get_local_copy(remote_url, cache_context=None, extract_archive=True, name=None, force_download=False)
Get a local copy of the remote file. If the remote URL is a direct file access, the returned link is the same, otherwise a link to a local copy of the url file is returned. Caching is enabled by default, cache limited by number of stored files per cache context. Oldest accessed files are deleted when cache is full. One can also use this function to prevent the deletion of a file that has been cached, as the respective file will have its timestamp refreshed
Parameters
remote_url (str ) – remote url link (string)
cache_context (str ) – Optional caching context identifier (string), default context ‘global’
extract_archive (bool ) – if True, returned path will be a cached folder containing the archive’s content, currently only zip files are supported.
name (str ) – name of the target file
force_download (bool ) – download file from remote even if exists in local cache
Return type
[str, None]
Returns
Full path to local copy of the requested url. Return None on Error.
StorageManager.get_metadata
classmethod get_metadata(remote_url, return_full_path=False)
Get the metadata of the remote object. The metadata is a dict containing the following keys: name, size.
Parameters
remote_url (str ) – Source remote storage location, tree structure of remote_url will be created under the target local_folder. Supports S3/GS/Azure, shared filesystem and http(s). Example: ‘s3://bucket/data/’
return_full_path (
bool
) – True for returning a full path (with the base url)
Return type
Optional
[dict
]Returns
A dict containing the metadata of the remote object. In case of an error, None is returned
StorageManager.list
classmethod list(remote_url, return_full_path=False, with_metadata=False)
Return a list of object names inside the base path or dictionaries containing the corresponding objects’ metadata (in case with_metadata is True)
Parameters
remote_url (str ) – The base path. For Google Storage, Azure and S3 it is the bucket of the path, for local files it is the root directory. For example: AWS S3: s3://bucket/folder will list all the files you have in s3://bucket-name/folder*/*. The same behaviour with Google Storage: gs://bucket/folder, Azure blob storage: azure://bucket/folder and also file system listing: /mnt/share/folder_
return_full_path (bool ) – If True, return a list of full object paths, otherwise return a list of relative object paths (default False).
with_metadata (
bool
) – Instead of returning just the names of the objects, return a list of dictionaries containing the name and metadata of the remote file. Thus, each dictionary will contain the following keys: name, size. return_full_path will modify the name of each dictionary entry to the full path.
Return type
Optional
[List
[Union
[str
,dict
]]]Returns
The paths of all the objects the storage base path under prefix or the dictionaries containing the objects’ metadata, relative to the base path. None in case of list operation is not supported (http and https protocols for example)
StorageManager.set_cache_file_limit
classmethod set_cache_file_limit(cache_file_limit, cache_context=None)
Set the cache context file limit. File limit is the maximum number of files the specific cache context holds. Notice, there is no limit on the size of these files, only the total number of cached files.
Parameters
cache_file_limit (int ) – New maximum number of cached files
cache_context (str ) – Optional cache context identifier, default global context
Return type
int
Returns
The new cache context file limit.
StorageManager.set_report_download_chunk_size
classmethod set_report_download_chunk_size(chunk_size_mb)
Set the download progress report chunk size (in MB). The chunk size determines how often the progress reports are logged: every time a chunk of data with a size greater than chunk_size_mb is downloaded, log the report. This function overwrites the sdk.storage.log.report_download_chunk_size_mb config entry
Parameters
chunk_size_mb (int ) – The chunk size, in megabytes
Return type
()
StorageManager.set_report_upload_chunk_size
classmethod set_report_upload_chunk_size(chunk_size_mb)
Set the upload progress report chunk size (in MB). The chunk size determines how often the progress reports are logged: every time a chunk of data with a size greater than chunk_size_mb is uploaded, log the report. This function overwrites the sdk.storage.log.report_upload_chunk_size_mb config entry
Parameters
chunk_size_mb (int ) – The chunk size, in megabytes
Return type
()
StorageManager.upload_file
classmethod upload_file(local_file, remote_url, wait_for_upload=True, retries=None)
Upload a local file to a remote location. remote url is the final destination of the uploaded file.
Examples:
upload_file('/tmp/artifact.yaml', 'http://localhost:8081/manual_artifacts/my_artifact.yaml')
upload_file('/tmp/artifact.yaml', 's3://a_bucket/artifacts/my_artifact.yaml')
upload_file('/tmp/artifact.yaml', '/mnt/share/folder/artifacts/my_artifact.yaml')
Parameters
local_file (str ) – Full path of a local file to be uploaded
remote_url (str ) – Full path or remote url to upload to (including file name)
wait_for_upload (bool ) – If False, return immediately and upload in the background. Default True.
retries (int ) – Number of retries before failing to upload file.
Return type
str
Returns
Newly uploaded remote URL.
StorageManager.upload_folder
classmethod upload_folder(local_folder, remote_url, match_wildcard=None)
Upload local folder recursively to a remote storage, maintaining the sub folder structure in the remote storage.
If we have a local file ~/folder/sub/file.ext then StorageManager.upload_folder(‘~/folder/’, ‘s3://bucket/’) will create s3://bucket/sub/file.ext
Parameters
local_folder (str ) – Local folder to recursively upload
remote_url (str ) – Target remote storage location, tree structure of local_folder will be created under the target remote_url. Supports Http/S3/GS/Azure and shared filesystem. Example: ‘s3://bucket/data/’
match_wildcard (str ) – If specified only upload files matching the match_wildcard Example: *.json Notice: target file size/date are not checked. Default True, always upload. Notice if uploading to http, we will always overwrite the target.
Return type
Optional
[str
]Returns
Newly uploaded remote URL or None on error.