Step onto the Trains

May 27, 2020

An overview of the mechanics behind Allegro AI’s next-generation experiment management platform


The mechanics behind Allegro AI’s next-generation experiment management platform

Allegro Trains is designed to make ML / DL experiment management smooth, painless, and free of heavy DevOps overhead, as well as to introduce collaboration and automatic documentation into built-in components of the development process. Here is our earlier post outlining the Six Key Benefits of working with Trains.

Trains Agent is an AI experiment cluster solution – a zero configuration fire-and-forget execution agent, which, combined with trains-server, provides a full AI cluster solution.Trains Agent enables data scientists (or DevOps) to control execution of machine/deep learning experiments with minimal effort, eliminating the inevitable conflict and ensuring a seamless development-to-training process. This post highlights the benefits you gain by using Trains Agent and describes how the Allegro Trains package and Allegro Trains Agent combine to optimize your workflow.

Stage One – Code Development & Debugging

This initial code development stage is where you write, run and debug your code on your development machine, as you would normally. This stage is all about correctness;the goal is to make sure that the code executes properly so  the model can start training as expected.

(The code execution already integrated with Trains will create an experiment in the Web-UI. This experiment instance will store the execution environment configuration together with the real-time collection of the run itself (stdout, matplotlib, Tensorboard, graphs, etc.)


Stage Two – Remote execution

You obviously execute and monitor an entire model training session manually, but you’d probably prefer to launch it on another machine as quickly as possible, so you can more efficiently continue developing your code on your own machine.

Stage Two is where we launch that remote training session, which  can be done easily, in its entirety from the web-UI. Cloning an experiment in the UI will essentially copy the experiment environment configuration, after which  we can change these settings (arguments, packages, etc.). Once we are done, we send it for execution by lining up the experiment in the job queue. The Trains Agent running patiently on a remote machine will detect your experiment,  automatically retrieve it, set up the environment on the remote machine, and start the execution..

The obvious next step is the addition of automation, so that we cantest various hyper-parameters, data augmentation, and more, zeroing in on and implementing the best combination for training our model. (Note: We refer to these automation scripts as meta-learning strategies, and we will have a followup post explaining how one can use our strategy examples to create state-of-the-art autoML, custom tailored for a specific use case).


Experiment configuration explained

In order to seamlessly move from Stage One to Stage Two, Trains needs to “record” the development environment in a transparent way so it does not interrupt with your development workflow. In Stage Two, this “snapshot” of the environment is then replicated on a different machine, to execute the identical code.

Let’s dive a bit deeper and review the list if parameters and components thats Trains records during Stage One, the development stage:

  • Code Base
    • Git Repository / Git Commit
    • Uncommitted git changes
    • Or Jupyter Notebook as python script
  • Python environment
    • Python packages & versions
  • Command line arguments (as passed by ArgParse)
  • Configuration
    • Specified Configuration File
    • Specified Internal configuration arguments
  • Performance Graphs
    • Automatically log tensorboard & matplotlib
    • Stdout
  • Artifacts
    • Automatically log loaded & stored models
    • Specified input data files
    • Store specified output artifacts on central storage

Next, let’s review how the Trains package collects and records the information on our environment. Then  we’ll dive into the Trains-Agent “Remote Execution” phase to understand how we restore the environment and override definitions and settings.


The Code

Trains supports three different setups for recording the code:

  1. GIT repository (code repository)
    In this case, the Trains package will store the repository url link, commit ID, and full uncommitted changes (in git-diff format) in the experiment entry (These elements are stored as plain text). The GIT reference is collected in real-time while the code is executed on the development machine.
  2. Jupyter Notebook
    If you are using Jupyter Notebook, Trains will constantly convert the last notebook checkpoint into a python script and store it as part of the experiment entry. This python script could later be executed on any machine by *Trains-agent*, eliminating the need to manually set up another Jupyter Session.
  3. Standalone Script
    If you are executing a single python script, Trains will take its content and store it as-is, as part of the experiment entry. This option allows you to quickly scale automation and simple training scripts without tying them to specific code repositories. This is probably the easiest way to write and scale automation or simple-training scripts.


Python Environment

When trying to replicate a Python execution environment, the most important aspect is probably the python packages and their specific versions. This is why Trains will automatically detect the specific python packages your code uses, and store them as an embedded part of the experiment entry.


Hyper-Parameters and Configuration File

Trains supports three different ways for storing configuration data:

  1. ArgParser
    As you probably know, argparse is a commonly used command line options parser. If you use argparse, Trains will collect all the ArgParser commands and default/running values, and store them into the experiment entry, in the hyper-parameter section. When Trains Agent runs an experiment, these values will be overridden with the values stored on the experiment.
  2. Parameter Dictionary
    A dictionary of key/values can be connected with your experiment. Once connected, *trains* will store this dictionary as part of the hyper-parameter section of the experiment entry.
    Note that you can have multiple dictionaries connected with the same experiment, and from any part of your code. When Trains-Agent executes the experiment, the connected dictionary values will be overridden with the values stored in the experiment section in the UI; this allows you to change parameters within the code base without the need to create a proper, separate external interface for them.
  3. Configuration File
    In many cases, you’ll have an external configuration file controlling various arguments of model setup and training process. Trains supports connecting such a configuration file, and storing its content into the experiment entry. As with other connected configurations / hyper-params, this allows you to edit the configuration file in the UI. When Trains Agent remotely executes the experiment, the configuration file will be filled with content stored in the UI, instead of the original data set. This feature allows for efficient model exploration, without specifically creating an external interface of the model configuration.


Input Data – Artifacts

Finally, let’s talk about a way to make the entire process more efficient.

  1. Model Weights
    Logging the initial weights used for the training process is always important. It is also important, of course, that it happens automatically. Logging the output model weights/checkpoints is nice, but the real value comes from the ability to create a copy of the locally created weights files in a central storage (shared folder/S3/GS, etc.) and store a link to the shared model weights for later use in either another training session or in production.    
  2. Artifacts
    Artifacts are a general purpose interface to store and load data from experiments, in order to log and document processes and connect them. An artifact can be a TF_RECORD used as data input, or a pre-processing feature extraction output stored as a pickle file meant to be used by other experiments. It can also be a list of all the data files used in a specific training session and stored as a Pandas DataFrame.
    In short, an artifact can be whatever you think will help you debug your experiment, create a data processing pipe, or streamline any other scaling process you might imagine.


As you can see, Allegro Trains integrates logically into your workflow and records every data component you need at each step … without complicating your development process. Quite the opposite, in fact; once implemented, you can focus on running, comparing and optimizing experiment outcomes, confident that your data is being stored for easy access, analysis and painless duplication the minute you need it.

Hey Stranger.
Sorry to tell you that this post refers to an older version of ClearML (which used to be called Trains).

We haven’t updated this yet so some, commands may be different.
As always, if you need any help, feel free to join us with Slack