Introducing Services, Controllers and Applications
To date Allegro Trains has provided the ability to easily create, manage, track and compare experiments, Including the added benefit of providing remote execution, orchestration and management of GPU resources / pods. This enables users of Trains to easily scale development efforts from their local laptops/machines to distributed clusters running in different locations. The guiding principle: Write Python code on your machine, run it anywhere with a click of a button. No need to worry about creating YAML files, dockerizing code, configuring another environment and messing around with kubernetes.
However, developing AI solutions requires much more than just being able to track experiments. Once development scales, many additional tasks become important: managing archives of experiments, monitoring tasks, pipelining and more. To date you could implement your own such services on top of Trains – and indeed many of our users had done just that. We thought – wouldn’t it be cool if we can just add this capability out of the box?
Welcome: Services / Controllers / Applications!
Version 0.15 extends the notion of ‘experiments’ into generic ‘Tasks’. As with experiments, write your controller or monitoring code on your machine, let Trains record the environment details, and have the Trains Agent execute it on any remote machine. This lets you easily manage the distributed Trains environment and automate ML/DL processes.
You will now find yourself building monitoring tasks or periodic jobs (e.g. periodic cleanup) which you’ll need to remain online while your development process continues.
With automation, you are now creating and running programs that control the execution of other tasks.
These services and controllers are typically not CPU/GPU intensive, but still need to be deployed. To this end Trains Server now boasts an instance of trains-agent running in the newly released Trains-Agent Services Mode, letting you easily launch any such service on your Trains Server as an additional docker (See Trains-Agent Services Mode). This gives you the ability to create services like the aforementioned periodic cleanup service and run them as part of the Trains Server setup, eliminating the need to spin a dedicated CPU machine just for long lasting services, monitoring applications etc. You can always setup multiple instances of Trains Agent on multiple machines, each one connected with their own job execution queue.
Trains now provides two new example programs as a quick – and very useful in our humble opinion – introduction to Application Controllers and Trains-Agent Services. These have come from frequent requests / inquiries from you – the Trains user community:
- A cleanup service to archive old experiments and free up space on the files-server. This sample code will take any archived experiment that is older than one month and delete it and all of its accompanying artifacts/models from the Trains Server.The code also introduces a cool new feature (another contribution from the community): The ability to stop the execution of a Task on one (say, a local) machine, and restart it on another (say, a remote) machine. This gives you the ability to quickly create an experiment in Trains, and immediately launch it on a remote machine. In other words, this is the quickest way to take sample code and use Trains to run it without the need to import/export/edit YAML files 🙂
- Trains now provides a Hyper-Parameter optimizer module which we’ve used to create a Hyper-Parameter Optimizer service. Use this service to do Grid / Random / Hyperband Bayesian optimization, on any experiment already in Trains, defining a configuration space and compute / time budgets, and launch the entire thing from your machine, where all the orchestration is taken care of for you. The Optimizer itself can be launched as a service, doing all the parameter sampling, launching and budgeting from a remote machine.
In follow-up posts we will expand on the usability of this awesome feature, and explore its explosive power on your workflows. If you’re interested in getting these posts directly to your inbox, sign up
Additional Features & Bug Fixes
On top of the major updates to Trains, we’ve taken your feedback on highly sought features and annoying bugs and crossed them off our to-do list. Here are some highlights:
- We fixed a few standing issues with the experiment manager web ui.
- Added support for audio and video debug samples to allow better visibility and debugging into the training process
- Flexible experiment table UI control: Move and resize columns to create leaderboards sorted by any metric that comes to mind with anyone in your organization.
- Add tags to experiments for efficient organization. Example use cases could be to tag experiments as “pre-release”, “top performing” etc, but could also be used to mark resources used “4xgpu” or data used “dataset_v1” etc.
- We also extended tags to Models, so you can easily tag a model with the release version “v1.0” or model type “regression” / “classifier” or anything else that comes to mind.
- Continuing on the efforts of creating more structure in the chaos, we also added a few more Task types like inference, data_processing, monitor, service, controller, optimizer, qc and custom.
Thank you for your continued support and for your feedback and suggestions on our Slack channel and GitHub issues page. We continue to improve Trains and your feedback will only help us do a better job. To that end, we encourage you to fill out our super short anonymous survey.
Until next time chu chu
The Allegro AI team