There is an inherent conflict in machine learning and deep learning workflows when moving from developing code to running it. Researchers usually need to turn to DevOps to execute their experiments as managing machines/data/setups require a different skill set, and often lie within the DevOps division’s responsibilities. However, even with the correct skills or purview, DevOps are often overwhelmed with the ever-changing setup requirements that are at the heart of the AI experimentation process. This transactional aspect of the process often leads to delays, frustration, inter-departmental friction and wasted resources.
Trains Agent enables ML/DL developers (or DevOps) to control execution of machine/deep learning experiments with minimal effort, eliminating the conflict and ensuring a seamless development-to-training process. This post highlights the benefits you gain by using Trains Agent and describes how the Allegro Trains package and Allegro Trains Agent combine to optimize your workflow.
Two-Stage Workflow Solution
The Trains DevOps workflow is a two-stage solution that starts with recording the experiment environment and ends with automatically running the experiment, locally or remotely.
Stage One: Development
Write the code on your ‘development machine’ — this can be your laptop, desktop, a remote machine, etc. At this stage, you execute the code for the first time, debug it and set initial parameters for execution. This creates an initial experiment entry in the Trains system, which stores everything related to the execution environment of the experiment.
Stage Two: Execution
You can now automatically (or through the Trains Web UI) launch an experiment on a single or multiple remote machine(s).
If you think of the second stage as a kind of Jenkins automation server for ML/DL, then the first stage is akin to an automagic way of creating the Jenkins script/yaml without actually writing it. We like to think of the Trains DevOps workflow as ‘Jenkins-on-steroids’ since it allows you to remotely execute your code, change all the arguments and create complex data processing pipes without actually having to change your code or add complex software infrastructure to an existing codebase.
Six Key Benefits for Your ML/DL DevOps Workflow
One-Click Multiple Remote Execution for In-Development Experiments
By combining Trains and Trains Agent, you can quickly move from running code on a development machine to automatically launching the same code, in its original execution environment, on multiple machines — all with a click of a button.
Trains records the execution environment in a way that allows Trains Agent to restore it on another (possibly remote) machine. Importantly, there’s no need to manually maintain pesky configuration files (YAML, json, etc.).
Your DevOps Workflow Optimized & Automated
The true power of Trains and Trains Agent integration comes from the ability to both automate the process and seamlessly allow you to change the experiment execution parameters. You can create a full data processing flow where the same processing
pipe can be repeated with a different data input, resulting in a new model.
The ML/DL ‘Jenkins-on-steroids’ described above is what allows you to quickly transition from coding to automation. This process enables you to launch experiments with different sets of parameters directly from the Web UI, or programmatically
automate the process by writing AutoML strategies and data piping scripts.
Easily Tweak Experiments and Compare Results
Once you have an initial experiment created in stage one, you can replicate it and change part or all of the execution environment. The replicated experiment can then be executed on any machine by Trains Agent. All you have to do is clone the
base experiment, then change its parameters and enqueue it in the job queue for remote execution.
Using this flow, you can track all the changes made to the original experiment, compare the model performances based on the different execution configurations, and select the best performing setting for the next training round. This flow can also
be fully automated with the help of the Allegro Trains python package.
Work on Multiple Experiments At The Same Time
By providing this seamless transition from code development to execution at scale, Trains allows you to simultaneously develop your code and conduct a training process. Specifically, you can start model training, and whilst that is running you
can already be working on the next iteration of the codebase.
Experience Genuine Teamwork & Collaboration
The fifth and perhaps for some, the most valuable benefit of using Trains Agent with Trains is the potential for teams to collaborate on deep learning and machine learning projects. Nowadays, traditional software development using feature
branches, code debugging and merging is easy thanks to version control systems. However, machine and deep learning process is not a linear progression, which makes it challenging to collaborate during ongoing development.
Trains & Trains Agent makes it easy to check the performance of other models used by your team, test someone else’s codebase with your data, or quickly replicate an entire environment to your machine.
While Trains displays your team’s progress, Trains Agent allows you to quickly grab a working model as an initial starting point for your training session, or clone a team member’s successful experiment and change its parameters to match your
training process. With Trains Agent, you can replicate an entire experiment environment to your machine with a single command, allowing you to quickly get on board with any project codebase.
Optimize Your Resources
The Trains package provides resource performance metrics throughout the experiment’s execution, making them available as part of the experiment results: CPU/GPU load, memory usage, network and disk I/O are all logged and easily accessible from
the Web UI. This experiment-level resource monitoring helps you detect memory leaks, identify performance degradation or perform GPU optimization – especially useful for long training/inference processes.
Combining Trains Agent with Trains takes this a step further by not only providing the same information (and more) at the execution node level, but also by providing visibility into resource usage at the cluster level.
You will quickly see idle workers or overcrowded queues, and be able to quickly change the order of execution or move experiments between the different queues/resources. Any slow converging or ill-converging experiments can be quickly located and aborted directly from the Trains UI, freeing machine resources for better-fated training sessions.
There you have it — six key benefits to streamline and manage your machine learning and deep learning DevOps workflow. Be sure to check out both Trains and Trains Agent, and leave us your feedback! We want to hear what you think.
Stay tuned for our next post which will present a detailed overview of the whole open-source solution suite — the full Trains ecosystem.