Building an MLOps infrastructure on OpenShift – guest blogpost

January 6, 2022

Originally Published  by Nicolas Jomeau – Republished by author approval

Most data science projects don’t pass the PoC phase and hence never generate any business value. In 2019, Gartner estimated that “through 2022, only 20% of analytic insights will deliver business outcomes”. One of the main reasons for this is undoubtedly that data scientists often lack a clear vision of how to deploy their solutions into production, how to integrate them with existing systems and workflows and how to operate and maintain them. This is where MLOps comes into play, as the supposed answer to all problems regarding the industrialization of machine learning solutions.

By combining DevOps principles to common Data Science work, MLOps aims to increase automation in the machine learning lifecycle to improve reproducibility and transparency in the experiment phase, enable reliable and efficient deployment of ML models in production, and render them observable. While there exists a myriad of integrated ML platforms such as DataikuDataRobot, or Azure MLOps, my internship at ELCA focused on exploring various open-source solutions to tackle the problem.

Photo by Stephen Dawson on Unsplash
Photo by Stephen Dawson on Unsplash

According to ThoughtWorks’ research, a good MLOps infrastructure should possess multiple features:

  • A fast experiment cycle: a data scientist can quickly try new data processing techniques, model architectures, or parameters and have the results (model artifacts and metrics) logged into one place for easy bench-marking between experiments.
  • Continuous integration: an automated pipeline ensures that ML code passes unit tests and that models trained are above a minimum performance threshold as well as respect the product interface to avoid breaking changes.
  • Continuous deployment: one click (if not zero) is enough to package, test, and immediately deploy a model into production. If it is an update to a previously deployed model, the update is done transparently to the user with no downtime.
  • Monitoring: once deployed, model performance and metrics on data drift are continuously monitored to detect anomalies early and automatically raise alerts if the model or infrastructure must be updated.
  • Scalability: the infrastructure scales according to the charge the model is experiencing to avoid spending too much and wasting costly computing resources, or too little and risking having degraded performance.
The common MLOps workflow — schema by author
The common MLOps workflow — schema by author
The Infrastructure

Before deploying the tools needed to satisfy the previously mentioned features, a platform to support them must be chosen. We aim to deploy the whole project on ELCA’s infrastructure while supporting resource sharing and scaling so ELCA’s OpenShift (RedHat’s packaged distribution of Kubernetes) imposed itself as the right tool for the job.

Using OpenShift (or other Kubernetes-based containerization solutions) already allows us to cover Scalability by controlling the number of deployed servers (via autoscaling Pods) and some part of Continuous Deployment by ensuring no downtime while deploying new models with Rolling Updates.

To cover our storage needs for saving datasets and trained models we use a MinIO S3 Object Storage server. MinIO is fully compatible with the Amazon S3 API so it can be easily substituted with another managed solution (Google GCP, Amazon S3, Azure, …) or be itself deployed on other cloud providers.

With the MLOps infrastructure configured, we can start adding one by one the building blocks of our solution to process data, train models, deploy and update them, and finally monitor the deployment.

The Tools

Feature Engineering, Machine Learning and Experiment Tracking
ClearML Platform — screenshot by the author
ClearML Platform — screenshot by the author

For the actual ML work, we decided to use the ClearML platform (v1.0), an all-in-one orchestration, data versioning, experiment tracking, and model repository solution. With as few as 2 additional lines in your Python code, you can enable ClearML to store versioned data, track metrics (and their associated plots), and save produced models binaries (pickle, h5, …) on the S3 server with a unique experiment name/ID. This data can then be reused by any other Python script simply by mentioning this ID.

In ClearML jargon, a script is called a Task and has access to every data and metadata on the ClearML server. This means a form of automation can be added by having tasks observing other tasks, for example, to test and validate their outputs (is the produced model good enough) or monitor newly added data (and trigger another data processing task).

Finally, you can choose where your tasks are executed: either locally or an a remote server. This allows fine-grained resource management as well as benefitting specialized hardware for specific tasks (e.g. GPUs for deep learning). Coupled with a monitoring task, cloud providers’ specialized virtual instances can be deployed on-demand to run a specific task, allowing to speed up work while spending the strict minimum on hardware.

Continuous Integration and Deployment

After producing a satisfying model, we must package it in an application ready for deployment. This step uses two tools well known to any DevOps engineer: Jenkins and Docker.

Jenkins is an automation tool to facilitate automating tasks (obviously!). In this project, we use it to run a sequence of actions: unit test the MLOps code, to verify the model we want to deploy is passing performance checks, to create a server using the model to run inference, to package the server in a Docker image, to run the server in a dedicated environment to perform automated integration tests, and finally to trigger the OpenShift’s rolling update. If any of these steps fails, the whole process is aborted. This ensures no faulty server can be deployed and minimizes the risks of an error.

The server is a simple FastAPI web-server that loads the model and exposes web APIs to run inference. We chose FastAPI as it allows us to fully control the server behavior and to easily add new features. For example, a design choice was to use a modular approach to the server. Instead of having only an ML model, we can add modules to do additional preprocessing (web inputs aren’t in the same format as the one used for training), outlier detection, or specific features monitoring. As there is no free meal, the high customizability comes at the cost of lower computational efficiency compared to MLOps tools like Seldon or BentoML (batch-inference, gRPC instead of HTTP, automatic GPU acceleration)

Monitoring
Monitoring of data and model drift on Grafana — screenshot by the author
Monitoring of data and model drift on Grafana — screenshot by the author

But simply deploying a model with a web API is not enough! To ensure it is working as expected, we need monitorability. Monitoring comes in two forms: logs (text) or metrics (numbers) describing events happening on the server. As we are mostly interested in performance metrics (time to inference, resource usage, data/predictions distributions), the former type of monitoring can be removed.

One of the most known metric logging tools combination, which we chose for this project, is the Prometheus/Grafana stack. It allows us to both manage the collection/aggregation as well as the display of metrics. Each deployed ML server stores metrics when requests are received which Prometheus collects from a web endpoint. Using the PromQL query language, the metrics can then be filtered and aggregated to get more detailed information (rate of requests, profiling of server inference, …). Grafana plugs itself on top of the Prometheus database to create real-time and interactive visualizations from these metrics.

Both of these tools provide automatic alerting if some conditions are met. We used the alerting system of Grafana due to its ease of configuration with the web UI and the many available alerting channels: any alert can be received instantaneously by the people operating the ML project with a Slack bot, while other services (e.g. a status page or the auto-scaler) can rely on a Kafka message.

Wrapping it up
MLOps architecture — schema by author
MLOps architecture — schema by author

Thanks to Kubernetes, the whole infrastructure can be deployed via a few automated scripts and is ready to be used in a matter of minutes.

The common workflow of the Data Scientist is to load data on ClearML, process this data with Python code or more advanced means (SQL, Spark, …), train a model (with some metrics for automated model validation) as well as some modules for the ML server. Each of these models/modules is stored on the ClearML server with a unique ID.

When a good model is found, the Project Manager can decide to deploy or update the current ML server by starting a Jenkins job. This job first runs unit tests on the code, downloads the model and modules, verifies their correctness with the tagging done by the model validator, and finally packages them with FastAPI in a Docker image to be pushed on ELCA’s repository (jFrog Artifactory). The image is then deployed on a test environment, tested against some queries to verify the API correctness, and deployed on the production environment with a rolling update to ensure no downtime.

Conclusion

Using Open-Source technologies, we have been able to create and automate through DevOps a system that manages the full life cycle of an ML model: from the initial training to the serving, including monitoring and retraining. The biggest challenge we encountered during this internship is the ever-moving state of the MLOps landscape. Although it may change with time, a lot of tools still haven’t reached maturity and may require testing them again in a few months.

Facebook
Twitter
LinkedIn