Dead-Simple ML-Ops on NVIDIA DGX & Kubernetes

May 19, 2020

Your organization has invested in the latest, state-of-the-art, NVIDIA DGX machines, and you have set up a Kubernetes cluster. However, the data science team keeps coming back to you to support a new version of a container they need, time and time again. Additionally, they struggle to optimally allocate resources to their different experiment and training needs – despite the K8S setup – and utilization of the DGX machines is less than satisfactory.

Sound familiar? To us, it certainly does. This is a situation we see time and again when we engage with new customers. In fact, in one such case, we witnessed over 50,000 (!) docker containers hogging the RAM on a single DGX-2.

Well, look no further. Here’s a solution that’s dead simple to set up, will get your data science team productive and utilizing your DGX investment to its fullest in no time, as all the while those emails and ticket requests directed to your team disappear.

Meet Allegro Trains

Meet Allegro Trains, an open source “automagical” experiment management, version control and ML-Ops solution. The solution is composed of a Python SDK, a Web UI, server and execution agents. It was designed to enable data scientists to easily track, manage and collaborate on their experiments while preserving their existing methods and practices.

Read more about it in the getting started section on the Trains Github page or refer to the full documentation.

Trains Agent, the Allegro Trains flexible ML-Ops orchestration module, certified by NVIDIA on DGX machines, was specifically built to address ML and DL R&D DevOps needs. Leveraging Trains Agent, you can set up a dynamic AI experiment cluster.

Some of the functionalities it comes with include:

Easily add & remove machines from the cluster
Reuse machines without the need for any dedicated containers or images
- Combine GPU resources across any cloud and on-prem
- No need for yaml/json/template configuration of any kind
- User friendly UI
Manageable resource allocation that can be used by researchers and engineers
Flexible and controllable scheduler with priority support

The great thing about Trains Agent is that it will run on the standard NVIDIA dockers, so optimized execution is guaranteed. At the same time, data scientists can execute jobs leveraging the Trains Agent with customized containers, reflecting their specific execution environments, all from a simple Web UI interface, and without needing any support from DevOps. And finally, you manage queues and priorities through a simple Web UI interface – once again by the data science team itself, without any need for DevOps support.

Sounds interesting? Setting up Allegro Trains to try it out is easy. Below, we will go through a short tutorial on how to set up an Allegro Trains environment for an NVIDIA DGX plus K8S environment so that your data science team can get their ML-Ops needs answered and give your team peace of mind.

Setting Up Allegro Trains

The heart of Allegro Trains is the Trains Server. The backend service infrastructure that enables multiple data scientists to collaborate and manage their experiments in one location.

In order to host your own server, you will need to install the Trains server and then make sure all the Allegro Trains instances point to it.

Trains Server contains the following components:

The Trains Web-App, a single-page UI for experiment management and browsing
RESTful API for:
- Documenting and logging experiment information, statistics and results
- Querying experiments history, logs and results
Locally-hosted file server for storing images and models making them easily accessible using the Trains Web-App

Follow the instructions below to add and deploy Trains Server (and Trains Agents) to your Kubernetes clusters using Helm:

First, make sure the following prerequisites are met:

A Kubernetes cluster
‘kubectl’ is installed and configured (see Install and Set Up kubectl in the Kubernetes documentation)
‘helm’ installed (see Installing Helm in the Helm documentation)
One node labeled ‘app: trains’

Important: Trains Server deployment uses node storage. If more than one node is labeled as ‘app: trains’ and you redeploy or update later, then Trains Server may not locate all of your data.

Set the required Elastic configuration for Docker

1. Connect to the node you labeled as ‘app=trains’

If your system contains a ‘/etc/sysconfig/docker’ Docker configuration file, Add the options in quotes to the available arguments in the ‘OPTIONS’ section:

'OPTIONS="--default-ulimit nofile=1024:65536 --default-ulimit memlock=-1:-1"'

Otherwise, edit ‘/etc/docker/daemon.json’ (if it exists) or create it (if it does not exist).
Add or modify the ‘defaults-ulimits’ section as shown below. Be sure the ‘defaults-ulimits’ section contains the ‘nofile’ and ‘memlock’ sub-sections and values shown.
Note: Your configuration file may contain other sections. If so, confirm that the sections are separated by commas (valid JSON format). For more information about Docker configuration files, see Daemon configuration file, in the Docker documentation.

The Trains Server required default values are (json):

    {
        "default-ulimits": {
            "nofile": {
                "name": "nofile",
                "hard": 65536,
                "soft": 1024
            },
            "memlock":
            {
                "name": "memlock",
                "soft": -1,
                "hard": -1
            }
        }
    }

2. Set the Maximum Number of Memory Map Areas

Elastic requires that the vm.max_map_count kernel setting, which is the maximum number of memory map areas a process can use, is set to at least 262144.

For CentOS 7, Ubuntu 16.04, Mint 18.3, Ubuntu 18.04 and Mint 19.x, we tested the following commands to set vm.max_map_count:

echo "vm.max_map_count=262144" > /tmp/99-trains.conf
sudo mv /tmp/99-trains.conf /etc/sysctl.d/99-trains.conf
sudo sysctl -w vm.max_map_count=262144

For information about setting this parameter on other systems, see the Elastic documentation.

3. Restart docker:

sudo service docker restart

Add Trains Server to the Kubernetes Cluster Using Helm

1. Fetch the ‘trains-server’ helm chart to your local directory:

helm fetch https://helm.ngc.nvidia.com/partners/charts/trains-chart-0.14.1+1.tgz

By default, the ‘trains-server’ deployment uses storage on a single node (labeled ‘app=trains’).
To change the type of storage used (for example NFS), see below in the Configuring trains-server storage for NFS section.
By default, one (1) instance of Trains Agent is created in the ‘trains’ Kubernetes cluster. This should be sufficient for most workloads.
To change this setting, create a local ‘values.yaml’ as specified in Configuring Trains Agents on your cluster section.

Install ‘trains-server-chart’ on your cluster:

helm install trains-chart-0.14.1+1.tgz --namespace=trains --name trains-server

Alternatively, in case you’ve created a local ‘values.yaml’ file, use:

helm install trains-chart-0.14.1+1.tgz --namespace=trains --name trains-server --values values.yaml

A trains ‘namespace’ is created in your cluster and trains-server is deployed in it.

Network Configuration – Port Mapping

After trains-server is deployed, the services expose the following node ports:
* API server on ‘30008’
* Web server on ‘30080’
* File server on ‘30081’

For more information on setting the Elastic configuration on Docker see Notes for production use and defaults.

Accessing the Trains Server

Access trains-server by creating a load balancer and domain name with records pointing to the load balancer.
Once you have a load balancer and domain name set up, follow these steps to configure access to trains-server on your k8s cluster:

Create domain records

Create 3 records to be used for Web-App, File server and API access using the following rules:

‘app.<your domain name>’
‘files.<your domain name>’
‘api.<your domain name>’

For example: ‘app.trains.mydomainname.com’, ‘files.trains.mydomainname.com’ and ‘api.trains.mydomainname.com’.

Point the records you created to the load balancer
Configure the load balancer to redirect traffic coming from the records you created:

‘app.<your domain name>’ should be redirected to k8s cluster nodes on port ‘30080’
‘files.<your domain name>’ should be redirected to k8s cluster nodes on port ‘30081’
‘api.<your domain name>’ should be redirected to k8s cluster nodes on port ‘30008’

Configuring Trains Agents in your cluster

In order to create ‘trains-agent’ instances as part of your deployment, create or update your local ‘values.yaml’ file.
This ‘values.yaml’ file should be used in your ‘helm install’ command or ‘helm upgrade’ command.
The file must contain the following values in the ‘agent’ section:

‘numberOfTrainsAgents’: controls the number of trains-agent pods to be deployed. Each agent pod will listen for and execute experiments from the trains-server
‘nvidiaGpusPerAgent’: defines the number of GPUs required by each agent pod
‘trainsApiHost’: the URL used to access the trains API server, as defined in your load-balancer (usually ‘https://api.<your domain name>’)
‘trainsWebHost’: the URL used to access the trains webservice, as defined in your load-balancer (usually ‘https://app.<your domain name>’)
‘trainsFilesHost’: the URL used to access the trains fileserver, as defined in your load-balancer (usually ‘https://files.<your domain name>’)

Additional optional values in the ‘agent’ section include:

‘defaultBaseDocker’: the default docker image used by the agent running in the agent pod in order to execute an experiment. Default is ‘nvidia/cuda’.
‘agentVersion’: determines the specific agent version to be used in the deployment, for example ‘”==0.13.3″‘. Default is ‘null’ (use latest version)
‘trainsGitUser’ / ‘trainsGitPassword’: GIT credentials used by ‘trains-agent’ running the experiment when cloning the GIT repository defined in the experiment, if defined. Default is ‘null’ (not used)
‘awsAccessKeyId’ / ‘awsSecretAccessKey’ / ‘awsDefaultRegion’: AWS account info used by ‘trains’ when uploading files to an AWS S3 buckets (not required if only using the default ‘trains-fileserver’). Default is ‘null’ (not used)
‘azureStorageAccount’ / ‘azureStorageKey’: Azure account info used by trains when uploading files to MS Azure Blob Service (not required if only using the default ‘trains-fileserver’). Default is ‘null’ (not used)

For example, the following ‘values.yaml’ file requests 4 agent instances in your deployment (see chart-example-values.yaml):

yaml
agent:
  numberOfTrainsAgents: 4
  nvidiaGpusPerAgent: 1
  defaultBaseDocker: "nvidia/cuda"
  trainsApiHost: "https://api.trains.mydomain.com"
  trainsWebHost: "https://app.trains.mydomain.com"
  trainsFilesHost: "https://files.trains.mydomain.com"
  trainsGitUser: null
  trainsGitPassword: null
  awsAccessKeyId: null
  awsSecretAccessKey: null
  awsDefaultRegion: null
  azureStorageAccount: null
  azureStorageKey: null

Configuring trains-server storage for NFS

The trains-server deployment uses a ‘PersistentVolume’ of type ‘HostPath’,
which uses a fixed path on the node labeled ‘app: trains’.
The existing chart supports changing the volume type to ‘NFS’,
by setting the ‘use_nfs’ value and configuring the NFS persistent volume using additional values in your local ‘values.yaml’ file:

yaml
storage:
  use_nfs: true
  nfs:
    server: "<nfs-server-ip-address>"
    base_path: "/nfs/path/for/trains/data"

Additional Configurations for trains-server

You can also configure the trains-server for:
* fixed users (users with credentials)
* non-responsive experiment watchdog settings

Summary

Kubernetes is an elegant, productive solution for standard software orchestration, but for ML-Ops it is wanting. With Allegro Trains on top of Kubernetes, there’s a long list of additive value that you get … for free:

A Web UI for job scheduling that is disconnected from the K8S scheduler
- Simplified resource scheduling leveraging a fixed set of DevOps-decided resource combinations: Queue per resource (1xgpu/ 2xgpu, low memory, high memory)
- Flexibility and transparency: Move jobs between queues, change job order and cancel all from the Web UI
- Web monitoring: Per machine GPU/CPU monitoring dashboard with historical data
Security: The scheduler is detached from the K8S security, so users of the scheduler do not need to have a K8S credentials to use the K8S!
Zero container maintenance (it’s all taken care of by trains-agent)
Production ready system: Easily build and export an experiment with its entire container and dependencies

Wrapping Up

There you have it! Point your colleagues from the data science team to Allegro Trains and the Allegro Trains Server you configured. You can also point them to other resources for support, such as our Slack channel and full documentation.

Then, sit back, relax and watch the utilization of your AI cluster go through the roof and your data science team’s productivity shoot to the sky. And enjoy that cup of coffee without worrying about that newfound quiet: your former data science team-related tickets aren’t coming your way anymore.

Hey Stranger.
Sorry to tell you that this post refers to an older version of ClearML (which used to be called Trains).

We haven’t updated this yet so some, commands may be different.