Experiment management doesn’t have to be chaotic
Republished here with the author’s permission – original post on Medium here.
Authored by Ivan Ralašić.
Every researcher or machine learning enthusiast faces that well-known experiment management nightmare; it’s usually a rude awakening discovered at the beginning of one’s career. Here’s how it goes:
After hours of waiting for your model training to finish, you finally get the results, and… Eureka! The model works brilliantly. To verify the model’s performance, you compare it to previous experiments, probably reviewing a giant, messy, manually-created spreadsheet with the previous experiment’s details. This is where you typically discover a weakness, a flaw, an unexpected result. … but what triggered it? You don’t know, of course, because there are so many variables to consider. You’re not sure which hyperparameters you used in which experiment, or even where the model weights have been stored.
You end up spending (well, wasting) more time on experiment management than you do on model development.
Now, some people are blessed with extraordinary organizational skills and successfully manage their experiments… while working on their own. But when another person joins you in your efforts to optimize an AI model, it’s not only your own organizational skills that matter. Again, you end up with a very complex, shared, experiment management scheme that just doesn’t scale.
Fortunately, at Forsight, we discovered a great open-source solution called Allegro Trains (In music terminology, allegro means to play fast, quickly and bright. Quite apt, as you’ll see).
At Forsight, an early-stage startup, our mission is to turn construction sites into safe environments for workers. Forsight uses computer vision and machine learning, processing real-time CCTV footage, to help safety engineers monitor the proper use of personal protection equipment to keep sites safe and secure. As you can imagine, we need to develop, compare, and refine a lot of models.
Allegro Trains helps Forsight by giving us comprehensive tools and an integrated solution to manage our production. With the Allegro Trains experiment management system, we got a long list of features with virtually no implementation or work. After you install Trains with:
pip install trains
…and with only two additional lines of code in your project:
from trains import Task task = Task.init(project_name="my project", task_name="my task"
…here’s the list of what you can expect to see, right out of the box:
- Git repository, branch, commit id, entry point, and local git diff
- Python environment (including specific packages & versions) settings
- stdout and stderr logs
- Resource Monitoring (CPU/GPU utilization, temperature, IO, network, etc.)
- Hyper-parameters logs
- ArgParser for command line parameters with currently used values
- Explicit parameters dictionary
- Tensorflow Defines (absl-py)
- Initial model weights file
- Model snapshots (with optional automatic upload to central storage like a shared folder or some cloud storage like S3, GS, Azure)
- Artifacts log & store (shared folder, S3, GS, Azure, Http)
- Tensorboard/TensorboardX scalars, metrics, histograms, images, audio and video
- Matplotlib & Seaborn
- Supported frameworks: PyTorch, Tensorflow, Keras, AutoKeras, XGBoost and Scikit-Learn (according to the product owners — MxNet is coming soon)
- Seamless integration (including version control) with Jupyter Notebook and PyCharm remote debugging
Instead of building an experiment management solution in-house, we opted to use this comprehensive Trains solution to boost the productivity and agility of our R&D team. The time that we usually spent on experiment management is now used to get the real work done: creating robust object detection models that can identify small objects like helmets in a complex construction site environment, as we optimize the PPE monitoring pipeline.
We think of Allegro Trains and actively use it as the “git for data science”. It enables collaboration, scalability, and reproducibility in the ML experiments, all of which are crucial to developing a great ML-powered product.
If you want to try out the Allegro Trains experiment management toolkit yourself, try their free demo server here.