By John Lambert, Developer; Matthew Shannon, Developer; and JoEllen Magnus, Business Analyst, SIL International
Client Overview
At SIL International, we serve every language community in the world, helping people flourish using the languages they value most. For over 90 years, we have been dedicated to identifying and categorizing 7000 languages around the world. Our teams at SIL tirelessly document these languages, teach literacy, and engage communities with the Christian scriptures in their own language. Our linguistic, literacy, and translation tools are used by wide swaths of academia and Bible translation organizations.
Over the past two years, our team at SIL has been using ClearML to organize research for numerous AI tools including building customized, fine-tuned translation models to serve individual language groups. ClearML was key in helping to run and organize our experiments, and after a successful pilot, we launched a production self-serve AI translation service in early 2024.
The Challenge
While ClearML was able to handle our research experiments exceptionally well, we were concerned about how well ClearML would handle production requirements. Uptime, API stability, and rapid user support were immediately top priorities.
For up-time, our on-premises GPU server has been cost-effective, but lack of redundancy made our AI tools vulnerable to potential disruptions. A server failure or a Texas hurricane could cause our service to be down for a week or more. We needed a solution that didn’t involve purchasing another server and housing it in another city, or a solution that involved us paying for cloud services that we were not going to use. We also needed something that didn’t require an engineer to manually intervene if something went down in the middle of the night. Finally, we needed a solution that could scale up if our single server got overwhelmed by user requests.
To add to these concerns was the fact that our GPU resources were looking more and more diverse – our central server, users’ GPU enhanced Windows laptops, and a loaned SLURM cluster. While only the central server could be used for production loads because of the specific requirements, we wanted all to be available for research and development.
The Solution
Through ClearML we combined our cost-effective, low(er) reliability GPU server with ClearML Autoscalers through Google Cloud, thereby only paying for the high-reliability GPUs when we need them.
One key feature was our ability to prioritize using our data center GPUs first. Only if the queue depth exceeds our specifications will the queues send jobs to the autoscaler. ClearML’s initial release of the autoscaling technology did not have that capability. When we contacted ClearML to ask about this functionality, we received a rapid and welcome answer: ClearML had just added the Policy Manager feature that could accommodate this very functionality.
A meeting was set up to discuss our use case further. Within one week, ClearML had upgraded our server to the latest enterprise version, enabling our desired functionality for the autoscaler. The ClearML team’s quick response time and thorough support clearly shone throughout the whole process.
As for our diverse research hardware, ClearML could easily interface with each system, allowing common S3 bucket access and running our research pipeline. We were utilizing docker images for consistent deployment for production. Our research team added a Conda virtual environment configuration to enforce consistency alongside the docker image. This also had the added benefit of running more simply on users’ laptops. ClearML’s flexibility, configurability, and reliability across the hardware and virtualization technologies continue to be a core asset for our team.

As for our diverse research hardware, ClearML could easily interface with each system, allowing common S3 bucket access and running our research pipeline. We were utilizing docker images for consistent deployment for production. Our research team added a Conda virtual environment configuration to enforce consistency alongside the docker image. This also had the added benefit of running more simply on users’ laptops. ClearML’s flexibility, configurability, and reliability across the hardware and virtualization technologies continue to be a core asset for our team.
The Results
As we start 2025, our team actively supports over 300 language groups with custom, fine-tuned translation models. Our main server is dedicated primarily to these production workloads with reliable backup, and our research can continue full speed ahead with the other GPU resources being well integrated. We are well positioned to continue to scale and have confidence that with ClearML we can have our current and future production and research needs well supported.