Data Management from CLI
In this example we'll create a simple dataset and demonstrate basic actions on it, using the clearml-data
CLI.
Prerequisites
First, make sure that you have cloned the clearml repository. It contains all the needed files.
Open terminal and change directory to the cloned repository's examples folder
cd clearml/examples/reporting
Creating Initial Dataset
To create the dataset, run this code:
clearml-data create --project datasets --name HelloDataset
Expected response:
clearml-data - Dataset Management & Versioning CLI
Creating a new dataset:
New dataset created id=24d05040f3e14fbfbed8edb1bf08a88cAdd a folder. File addition is recursive, so it's enough to point at the folder to captures all files and sub-folders:
clearml-data add --files data_samples
Expected response:
clearml-data - Dataset Management & Versioning CLI
Adding files/folder to dataset id 24d05040f3e14fbfbed8edb1bf08a88c
Generating SHA2 hash for 2 files
Hash generation completed
5 files added
After creating a dataset, its ID doesn't need to be specified when running commands, such as add
, remove
, or list
Close the dataset - this command uploads the files. By default, the files are uploaded to the file server, but this can be configured with the
--storage
flag to any of ClearML's supported storage mediums (see storage). The command also finalizes the dataset, making it immutable and ready to be consumed.clearml-data close
Expected response:
clearml-data - Dataset Management & Versioning CLI
Finalizing dataset id 24d05040f3e14fbfbed8edb1bf08a88c
Pending uploads, starting dataset upload to https://files.community.clear.ml
Uploading compressed dataset changes (4 files, total 221.56 KB) to https://files.community.clear.ml
Upload completed (221.56 KB)
2021-05-04 09:32:03,388 - clearml.Task - INFO - Waiting to finish uploads
2021-05-04 09:32:04,067 - clearml.Task - INFO - Finished uploading
Dataset closed and finalized
Listing Dataset Content
To see that all the files were added to the created dataset, use clearml-data list
and enter the ID of the dataset
that was just closed.
clearml-data list --id 24d05040f3e14fbfbed8edb1bf08a88c
Expected response:
clearml-data - Dataset Management & Versioning CLI
List dataset content: 24d05040f3e14fbfbed8edb1bf08a88c
Listing dataset content
file name | size | hash
-----------------------------------------------------------------------------------------------------------------
dancing.jpg | 40,484 | 78e804c0c1d54da8d67e9d072c1eec514b91f4d1f296cdf9bf16d6e54d63116a
data.csv | 21,440 | b618696f57b822cd2e9b92564a52b3cc93a2206f41df3f022956bb6cfe4e7ad5
picasso.jpg | 114,573 | 6b3c67ea9ec82b09bd7520dd09dad2f1176347d740fd2042c88720e780691a7c
sample.json | 132 | 9c42a9a978ac7a71873ebd5c65985e613cfaaff1c98f655af0d2ee0246502fd7
sample.mp3 | 72,142 | fbb756ae14005420ff00ccdaff99416bebfcea3adb7e30963a69e68e9fbe361b
Total 5 files, 248771 bytes
Creating a Child Dataset
Using ClearML Data, you can create child datasets that inherit the content of other datasets.
Create a new dataset, specifying the previously created one as its parent:
clearml-data create --project datasets --name HelloDataset-improved --parents 24d05040f3e14fbfbed8edb1bf08a88c
noteYou'll need to input the Dataset ID you received when created the dataset above
Add a new file.
- Create a new file:
echo "data data data" > new_data.txt
- Now add the file to the dataset:
clearml-data add --files new_data.txt
The console should display this output:
clearml-data - Dataset Management & Versioning CLI
Adding files/folder to dataset id 8b68686a4af040d081027ba3cf6bbca6
1 file added- Create a new file:
Remove a file. You need to specify the file's full path (within the dataset, not locally) to remove it.
clearml-data remove --files data_samples/dancing.jpg
Expected response:
clearml-data - Dataset Management & Versioning CLI
Removing files/folder from dataset id 8b68686a4af040d081027ba3cf6bbca6
1 files removedClose and finalize the dataset
clearml-data close
Look again at the files in the dataset:
clearml-data list --id 8b68686a4af040d081027ba3cf6bbca6
And see that the changes have been made!
new_data.txt
has been added, anddancing.jpg
has been removed.file name | size | hash
------------------------------------------------------------------------------------------------------------------------------------------------
data.csv | 21,440 | b618696f57b822cd2e9b92564a52b3cc93a2206f41df3f022956bb6cfe4e7ad5
new_data.txt | 15 | 6df986a2154902260a836febc5a32543f5337eac60560c57db99257a7e012051
picasso.jpg | 114,573 | 6b3c67ea9ec82b09bd7520dd09dad2f1176347d740fd2042c88720e780691a7c
sample.json | 132 | 9c42a9a978ac7a71873ebd5c65985e613cfaaff1c98f655af0d2ee0246502fd7
sample.mp3 | 72,142 | fbb756ae14005420ff00ccdaff99416bebfcea3adb7e30963a69e68e9fbe361b
Total 5 files, 208302 bytes
By using clearml-data
, a clear lineage is created for the data. As seen in this example, when a dataset is closed, the
only way to add or remove data is to create a new dataset, and to use the previous dataset as a parent. This way, the data
is not reliant on the code and is reproducible.