By Alberto Garcia, Senior Machine Learning Engineer, and Dr. Ksenia Yashina, Head of AI, Orbem
Orbem, a deep-tech startup at the forefront of industrial MRI, processes vast amounts of MRI data every day to classify objects like eggs or seeds on automated production lines. These specialized MRI scanners produce terabytes of data, and our AI models must handle everything in real time on edge devices – where every millisecond counts.

In such an environment, debugging training failures, optimizing inference speed, and making sure everything runs as expected in production can become overwhelming. We’ve spent late nights wrestling with SSH sessions, mismatched libraries, and GPU hiccups just to reconstruct a crashed training run. Tracing performance bottlenecks in production usually involved combing through hardware timelines and memory logs. It was messy.
ClearML Session has transformed this entire process. ClearML Session is a feature that allows users to launch a session of JupyterLab, VS Code, and SSH, and to execute code on a remote machine that better meets resource needs. This feature provides local links to access JupyterLab and VS Code on a remote machine over a secure and encrypted SSH connection. Instead of juggling manual setups, we now step seamlessly into any past experiment to diagnose issues interactively and optimize models in production-like environments, all without the usual operational headaches.
Debugging AI Training Runs with ClearML Session
Training deep learning models for industrial MRI isn’t just about hitting high accuracy; we need stable data versioning and training pipelines and a strategy for dealing with unexpected errors. We’ve seen our share of training crashes: unstable loss curves, mislabeled data or shape mismatches, and elusive data loader bugs that only appear in large-scale runs.

Before ClearML Session, debugging these problems was slow and frustrating. Logs were often missing details, environments might have changed, and GPUs were typically tied up with other tasks. We would waste time manually trying to reconstruct old experiments, hoping the same data and dependency versions were still available. Hardware-related issues like CUDA driver mismatches or memory overflows also behaved differently from machine to machine, so local debugging wasn’t always reliable.
Now, when a model crashes, we launch an interactive debugging session in the exact environment where it failed. One command brings back datasets, dependencies, and even uncommitted code changes, independent of local environment or hardware:
clearml-session --debugging-session <task-id>
Inside that session, we can step through code, inspect variables, and re-run partial training steps. We can even visualize mini-batches to catch mislabeled data early and profile GPU bottlenecks. Instead of spending hours reconstructing a failed run, we can fix the issue right away, often within minutes.
Because every dataset version is tracked by ClearML’s dataset versioning, these debugging runs stay fully reproducible. If a particular dataset version was the culprit, we can relaunch a new session with that exact dataset. Model artifacts also remain linked to the experiments they came from, so we can retrieve a trained model and validate its performance in precisely the environment it was originally developed in.
ClearML Session also shines in how it orchestrates workflows across multiple compute environments. Whether we run on AWS, Google Cloud, Azure, or our local on-prem hardware, ClearML Session can launch an interactive debugging or development session on any connected instance. This flexibility helps us troubleshoot our training pipelines on different types of hardware.
Profiling, Optimizing, and Deploying Models at Scale
Once training is stable, our next challenge is to ensure the trained model can handle real-world performance requirements. Our MRI scanners process thousands of scans per hour, so even small inefficiencies can be expensive. Debugging sessions help us fix failures, but they also let us test and optimize the model on target hardware before rolling it out.
To profile inference performance on production-like hardware, we run:
clearml-session --queue gpu_queue_A100 --docker registry/nvidia_tensort:latest --session-name "Production GPU profiling"
This command automatically launches an interactive session matching our production environment. Using tools such as Nsight Systems or PyTorch Profiler, we can inspect kernel execution timelines and identify data-transfer or processing bottlenecks. Based on these insights, we experiment with optimizations like pinned memory (for faster data movement) and mixed-precision inference (for a speed/accuracy trade-off).
Once bottlenecks are resolved, we verify throughput and memory usage in the session to ensure the GPU is efficiently utilized. By feeding real data into ClearML Session, we confirm the model’s latency and resource requirements before it ever goes into production.
Why Orbem Relies on ClearML Session
ClearML Session has become essential to how we develop AI models. We can debug failed training runs in minutes, catch performance bottlenecks early, and confirm models on hardware that mirrors production. This saves us days of trial and error and stops us from having to wrestle with infrastructure. As a result, we can invest our time in actual innovation, training models to shed light on the world’s toughest challenges.
If you’ve ever struggled to reproduce a failed run or worried about whether your model meets production constraints, ClearML Session is worth exploring. Request a ClearML demo to learn more about their AI infrastructure platform, encompassing an AI Development Center to streamline AI/ML workflows, an Infrastructure Control Plane to manage GPU clusters and optimize compute utilization, and GenAI App Engine to deploy GenAI models effortlessly.
No more “works on my machine” moments – only fast, scalable, and production-ready AI.
More about Orbem
At Orbem, we are shedding light on the world’s toughest challenges by unleashing AI-powered imaging for everything and everyone. For us, this means developing fast, accurate, and accessible imaging solutions that reveal hidden sources of knowledge.
Based on years of scientific research at the interface of AI and imaging technology, Orbem was founded in 2019 as a spin-off from the Technical University of Munich. Headquartered in Munich, our top-tier, international, diverse, and multidisciplinary team is imagining new frontiers every single day to build a sustainable and healthy future. Learn more at https://orbem.ai/ and https://orbem.ai/blog/.