In the context of the rapid development of global AI translation technology, the first application-oriented AI translation evaluation list, TransBench, has been officially released. This list is jointly developed by the International AI Business Team of Alibaba, Shanghai Artificial Intelligence Laboratory, and Beijing Language University. Its aim is to provide the industry with a standardized assessment of translation quality.
Different from traditional translation evaluations, TransBench introduces new indicators such as hallucination rate, cultural taboo words, and honorific norms, focusing on key issues in large model translations. These indicators are derived from feedback in real usage scenarios, aiming to reflect the practicality and cultural adaptability of translations. For instance, if the translation result appears smooth but contains "fabricated" information, it will be marked as a "hallucination"; similarly, translations that do not align with local culture or lack necessary polite language will also affect the evaluation results.
According to the latest evaluation results of the list, GPT-4o remains at the top of the "ceiling" for translation AI, performing excellently in multi-language translation with the highest comprehensive score. Following closely behind are DeepL Translate and GPT-4-Turbo. Among them, DeepL Translate is a model specifically designed for machine translation. The latest version was just released last month, significantly improving translation quality. In the e-commerce industry, DeepSeek-R1 also performs outstandingly, demonstrating its competitiveness in specific fields.
In terms of cultural characteristics, Qwen series models perform impressively, with Qwen2.5-0.5B-Instruct and Qwen2.5-1.5B-Instruct ranking first and second respectively, showcasing their advantages in cross-cultural translation. This series of models is jointly developed by multiple research institutions, supporting multiple languages, aiming to enhance the cultural adaptability of translations.
In Chinese translation, GPT-4o tops the list again, followed by DeepSeek-V3 and Claude-3.5-Sonnet. Especially in the e-commerce field, DeepSeek-V3 has drawn significant attention due to its excellent scores.
The evaluation methods and datasets of TransBench have now been made open-source, encouraging major AI translation agencies to participate and conduct horizontal comparisons and performance assessments. This move not only provides a foundation for industry standardization but also promotes further development of AI translation technology.
The Alibaba International AI Business Team stated that with the continuous progress of translation technology, the industry's requirements for translation models are becoming increasingly stringent. TransBench is precisely the evaluation standard responding to this demand. In the future, Alibaba International will continue to focus on the application of AI technology to help more enterprises achieve global development.
With the intensifying competition in the AI translation market, the release of TransBench undoubtedly provides the industry with a clear benchmark, also offering users a reliable reference standard when choosing translation services.