Huzzah! We officially welcome Allegro Trains Agent, to the Allegro Trains ecosystem, the zero-configuration, fire-and-forget execution agent, bridging the gap between data scientists and DevOps. This complementary solution joins our open-source solution suite with Allegro Trains, the automagical open-source AI experiment & version control manager, and Allegro Trains Server, the backend service infrastructure. Allegro AI strives to empower data scientists, researchers and algorithm engineers to seamlessly run, track, reproduce and collaborate on successful machine learning (ML) and deep learning (DL) experiments. With the release of Trains Agent, our open-source solution suite provides a full AI cluster solution, with all the packaging and resource management handled.
Why Trains Agent?
At Allegro AI, we are proud to be taking part in the ongoing development of this community and love every question, request for a new feature and piece of feedback. Trains Agent is the outcome of a need we felt ourselves and saw echoing within the community — it is the solution researchers seem to be searching for when handling machine learning and deep learning workflow processing.
Trains Agent is built to provide resource control and autoML capabilities with a simple, flexible interface, eliminating the heavy-lifting of machine learning and deep learning DevOps requirements.
True to the purpose of its creation, Trains Agent is simple to install and requires little maintenance.
The Background Behind Trains Agent
In modern software development, creating the software doesn’t require any special infrastructure. In general, post-development considerations relating to running multiple production environments, scale, automation, CI/CD flows, etc. are handled by DevOps personnel. However, in the process of AI experimentation, not only is the process highly nuanced and complex, there is no single stage where the researcher hands off the model to the DevOps team. It is a continuous, non-linear experimentation process, which requires the researcher to stay involved.
With machine learning and deep learning workflows, the processes get more complicated. The actual hardware you write your code on is often not sufficient for execution. In deep learning, it’s mostly the need for GPUs and in machine learning, it is mostly memory and CPU. In both cases, if you are using your laptop, for example, effective development might require the use of additional dedicated hardware.
Compounded with the needs of individualized hardware, comes the model training process. Traditional DevOps usually takes one service and manages its operation (e.g. scale, availability, etc.). In ML and DL, however, every experiment has different needs. Manually addressing each experiment’s needs, continuously creating individualized containers and dockers, is often laborious and overwhelming.
Research vs. Manufacturing
As mentioned briefly above, the machine learning experiment process as a whole is not linear. In most software development scenarios, there is a linear pipeline where one develops, then packages the code and moves it to another team. In the machine learning workflow, there is a constant, almost endless back-and-forth between the data scientist and DevOps as each model is tweaked and complex code repositories are run. In the ideal scenario, the researcher should be able to freely perform experiments without external constraints, with no barrier between writing the machine learning or deep learning code and training the model with various sets of parameters — just as easily as you would hit your F9/Apple-R or any other compile/run combination on your favorite IDE.
Let’s not forget about the never-ending, tedious task of training with different parameters. You run the same experiment over and over, tweaking one parameter or another, testing out different hypotheses and hoping for the desired results. This process could/should be automated.
Some refer to this part as autoML, others automation (please do let us know what you call it), as the official technical term has yet to emerge.
(If you’re interested, here are Trains Agent autoML & orchestration examples.)
How Trains and Trains Agent Seamlessly Run Your Code
When running your experiments with Allegro Trains, all your code and experiment artifacts are recorded, including python packages, program arguments, hyperparameters, source control origin, and uncommitted changes. Trains Agent can easily replicate the recorded environment, letting you tweak parameters, modify command-line arguments and even change python packages or versions.
You can then take your machine learning or deep learning experiment and send it for execution — on multiple machines with different parameters, with one click straight from the UI — there’s no container packaging, no requirements.txt, you don’t even need to commit your code. All of these experiments will be logged by Trains and you can see your logs, plots and scalars in the Trains UI.
In our next post, we list the key benefits when using Trains Agent and explain the two-stage workflow.