Ambabel, a technology company that provides machine-human translation services for businesses, has developed a new AI model that it claims outperforms OpenAI's GPT-4o and other commercially available AI systems in translating between English and six commonly spoken European and Asian languages.
Translation is one of the most compelling business use cases for large language models (LLMs), the AI systems that underpin chatbots like OpenAI's ChatGPT, Google's Gemini, and Anthropic's Claude. And to date, the latest version of OpenAI's most powerful AI model, GPT-4o, outperforms all competitors when it comes to translating languages with large amounts of digital text (GPT-4's performance in “low-resource languages,” which have far fewer digital documents to train on, is better than ever).
Unbabel tested its AI model, which it calls TowerLLM, against GPT-4o, the original GPT-4, OpenAI's GPT-3.5, and competing models from Google and language translation company DeepL. Looking at translations from English to Spanish, French, German, Portuguese, Italian, and Korean, TowerLLM slightly outperformed GPT-4o and GPT-4 in almost every case. TowerLLM's best accuracy was in English to Korean, where it beat OpenAI's best model by about 1.5%. In English to German translations, GPT-4 and GPT-4o outperformed by just a few percentage points.
Unbabel also tested its models on translating documents in specific domains, including finance, medicine, law, and technical documents. Here, TowerLLM again performed 1% to 2% better than OpenAI's best model.
Unbabel's results have not been independently verified, but if confirmed, the fact that GPT-4 outperformed it in translation could indicate that the model, which remains the best-performing LLM in most language benchmarks despite debuting 15 months ago (an eternity in the fast-paced world of AI development), may be vulnerable to new AI systems that are trained differently. OpenAI is reportedly training a more powerful LLM, but has no set release date.
San Francisco and Lisbon-based Ambabel said TowerLLM was trained to be multilingual on large public datasets of multilingual text, meaning the model performs better on inference tasks in multiple languages than competing open-source AI models of similar scale created by companies such as Meta and French AI startup Mistral.
TowerLLM was then fine-tuned using a carefully curated dataset of high-quality translations between language pairs. Unbabel was able to help curate this fine-tuning dataset using another AI model (called COMETKiwi) that it trained to assess translation quality.
“We are excited to be working with Ambabel to bring our technology to market,” said Joan Graça, Ambabel's chief technology officer. luck He points out that most other LLMs have a high percentage of English text in their initial training sets and have only acquired translation capabilities by chance, but TowerLLM was trained on a dataset specifically designed to include large amounts of multilingual text, and that fine-tuning on a smaller dataset with a select number of high-quality translations was key to the resulting model's superior performance.
This is one of several recent examples where smaller AI models have performed as well or better than much larger models when trained on higher quality datasets. For example, Microsoft created a small language model called Phi 3 with just 3.8 billion parameters (adjustable variables in the model), but by creating what Microsoft calls a “textbook quality” dataset, it outperformed models more than twice its size. “The insight from Phi is that people need to focus on the quality of the data,” Graça said. He noted that all AI companies use the same basic algorithm design today, with subtle differences. It's the data that differentiates the models. “It's all about the data and the training curriculum — how you feed the data to the model,” he said.
TowerLLM is currently offered in two sizes: one with 7 billion parameters and one with 13 billion parameters. A previous version of the model, which debuted in January, came close to but didn't surpass GPT-4's performance. That model also only worked on 10 language pairs. The new model slightly outperforms GPT-4, supporting 18 language pairs.
Since this model has only been tested against GPT-4o for translation, GPT-4 may still have an advantage in other tasks such as reasoning, coding, writing, and summarizing.
Graça said that Ambabel plans to expand the number of languages supported by Tower LLM, adding 10 more languages in the near future. The model is also fine-tuned to handle the highly specialized translation tasks that companies often care about most, such as translating complex legal documents, patent information, and copyright information. Graça said that the model is trained to improve its “transcreation” skills, which is the skill of translating content not verbatim but in a way that captures very subtle cultural nuances, such as colloquialisms and slang used by native speakers of a certain generation.