Once Tasks are defined and in the ClearML system, they can be chained together to create Pipelines.
Pipelines provide users with a greater level of abstraction and automation, with Tasks running one after the other.
Tasks can interface with other Tasks in the pipeline and leverage other Tasks' work products.
We'll go through a scenario where users create a Dataset, process the data then consume it with another task, all running as a pipeline.
Let's assume we have some code that extracts data from a production Database into a local folder. Our goal is to create an immutable copy of the data to be used by further steps:
We could also add a Tag
latest to the Dataset, marking it as the latest version.
The second step is to preprocess the date. First we need to access it, then we want to modify it and lastly we want to create a new version of the data.
We passed the
parents argument when we created v2 of the Dataset, this inherits all the parent's version content.
This will not only help us in tracing back dataset changes with full genealogy, but will also make our storage more efficient,
as it will only store the files that were changed \ added from the parent versions.
When we will later need access to the Dataset it will automatically merge the files from all parent versions
in a fully automatic and transparent process, as if they were always part of the requested Dataset.
We can now train our model with the latest Dataset we have in the system.
We will do that by getting the instance of the Dataset based on the
(if by any chance we have two Datasets with the same tag we will get the newest).
Once we have the dataset we can request a local copy of the data. All local copy requests are cached,
which means that if we are accessing the same dataset multiple times we will not have any unnecessary downloads.
Now that we have the data creation step, and the data training step, let's create a pipeline that when executed, will first run the first and then run the second. It is important to remember that pipelines are Tasks by themselves and can also be automated by other pipelines (i.e. pipelines of pipelines).
We could also pass the parameters from one step to the other (for example
See more in the full pipeline documentation here.