Originally Published by Nicolas Jomeau – Republished by author approval
Most data science projects don’t pass the PoC phase and hence never generate any business value. In 2019, Gartner estimated that “through 2022, only 20% of analytic insights will deliver business outcomes”. One of the main reasons for this is undoubtedly that data scientists often lack a clear vision of how to deploy their solutions into production, how to integrate them with existing systems and workflows and how to operate and maintain them. This is where MLOps comes into play, as the supposed answer to all problems regarding the industrialization of machine learning solutions.
By combining DevOps principles to common Data Science work, MLOps aims to increase automation in the machine learning lifecycle to improve reproducibility and transparency in the experiment phase, enable reliable and efficient deployment of ML models in production, and render them observable. While there exists a myriad of integrated ML platforms such as Dataiku, DataRobot, or Azure MLOps, my internship at ELCA focused on exploring various open-source solutions to tackle the problem.
According to ThoughtWorks’ research, a good MLOps infrastructure should possess multiple features:
- A fast experiment cycle: a data scientist can quickly try new data processing techniques, model architectures, or parameters and have the results (model artifacts and metrics) logged into one place for easy bench-marking between experiments.
- Continuous integration: an automated pipeline ensures that ML code passes unit tests and that models trained are above a minimum performance threshold as well as respect the product interface to avoid breaking changes.
- Continuous deployment: one click (if not zero) is enough to package, test, and immediately deploy a model into production. If it is an update to a previously deployed model, the update is done transparently to the user with no downtime.
- Monitoring: once deployed, model performance and metrics on data drift are continuously monitored to detect anomalies early and automatically raise alerts if the model or infrastructure must be updated.
- Scalability: the infrastructure scales according to the charge the model is experiencing to avoid spending too much and wasting costly computing resources, or too little and risking having degraded performance.
The Infrastructure
Before deploying the tools needed to satisfy the previously mentioned features, a platform to support them must be chosen. We aim to deploy the whole project on ELCA’s infrastructure while supporting resource sharing and scaling so ELCA’s OpenShift (RedHat’s packaged distribution of Kubernetes) imposed itself as the right tool for the job.
Using OpenShift (or other Kubernetes-based containerization solutions) already allows us to cover Scalability by controlling the number of deployed servers (via autoscaling Pods) and some part of Continuous Deployment by ensuring no downtime while deploying new models with Rolling Updates.
To cover our storage needs for saving datasets and trained models we use a MinIO S3 Object Storage server. MinIO is fully compatible with the Amazon S3 API so it can be easily substituted with another managed solution (Google GCP, Amazon S3, Azure, …) or be itself deployed on other cloud providers.
With the MLOps infrastructure configured, we can start adding one by one the building blocks of our solution to process data, train models, deploy and update them, and finally monitor the deployment.