Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wu, Minghao, Wang, Weixuan, Liu, Sinuo, Yin, Huifeng, Wang, Xintong, Zhao, Yu, Lyu, Chenyang, Wang, Longyue, Luo, Weihua, Zhang, Kaifu
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2504.15521
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908331532091392
author	Wu, Minghao Wang, Weixuan Liu, Sinuo Yin, Huifeng Wang, Xintong Zhao, Yu Lyu, Chenyang Wang, Longyue Luo, Weihua Zhang, Kaifu
author_facet	Wu, Minghao Wang, Weixuan Liu, Sinuo Yin, Huifeng Wang, Xintong Zhao, Yu Lyu, Chenyang Wang, Longyue Luo, Weihua Zhang, Kaifu
contents	As large language models (LLMs) continue to advance in linguistic capabilities, robust multilingual evaluation has become essential for promoting equitable technological progress. This position paper examines over 2,000 multilingual (non-English) benchmarks from 148 countries, published between 2021 and 2024, to evaluate past, present, and future practices in multilingual benchmarking. Our findings reveal that, despite significant investments amounting to tens of millions of dollars, English remains significantly overrepresented in these benchmarks. Additionally, most benchmarks rely on original language content rather than translations, with the majority sourced from high-resource countries such as China, India, Germany, the UK, and the USA. Furthermore, a comparison of benchmark performance with human judgments highlights notable disparities. STEM-related tasks exhibit strong correlations with human evaluations (0.70 to 0.85), while traditional NLP tasks like question answering (e.g., XQuAD) show much weaker correlations (0.11 to 0.30). Moreover, translating English benchmarks into other languages proves insufficient, as localized benchmarks demonstrate significantly higher alignment with local human judgments (0.68) than their translated counterparts (0.47). This underscores the importance of creating culturally and linguistically tailored benchmarks rather than relying solely on translations. Through this comprehensive analysis, we highlight six key limitations in current multilingual evaluation practices, propose the guiding principles accordingly for effective multilingual benchmarking, and outline five critical research directions to drive progress in the field. Finally, we call for a global collaborative effort to develop human-aligned benchmarks that prioritize real-world applications.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_15521
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks Wu, Minghao Wang, Weixuan Liu, Sinuo Yin, Huifeng Wang, Xintong Zhao, Yu Lyu, Chenyang Wang, Longyue Luo, Weihua Zhang, Kaifu Computation and Language As large language models (LLMs) continue to advance in linguistic capabilities, robust multilingual evaluation has become essential for promoting equitable technological progress. This position paper examines over 2,000 multilingual (non-English) benchmarks from 148 countries, published between 2021 and 2024, to evaluate past, present, and future practices in multilingual benchmarking. Our findings reveal that, despite significant investments amounting to tens of millions of dollars, English remains significantly overrepresented in these benchmarks. Additionally, most benchmarks rely on original language content rather than translations, with the majority sourced from high-resource countries such as China, India, Germany, the UK, and the USA. Furthermore, a comparison of benchmark performance with human judgments highlights notable disparities. STEM-related tasks exhibit strong correlations with human evaluations (0.70 to 0.85), while traditional NLP tasks like question answering (e.g., XQuAD) show much weaker correlations (0.11 to 0.30). Moreover, translating English benchmarks into other languages proves insufficient, as localized benchmarks demonstrate significantly higher alignment with local human judgments (0.68) than their translated counterparts (0.47). This underscores the importance of creating culturally and linguistically tailored benchmarks rather than relying solely on translations. Through this comprehensive analysis, we highlight six key limitations in current multilingual evaluation practices, propose the guiding principles accordingly for effective multilingual benchmarking, and outline five critical research directions to drive progress in the field. Finally, we call for a global collaborative effort to develop human-aligned benchmarks that prioritize real-world applications.
title	The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
topic	Computation and Language
url	https://arxiv.org/abs/2504.15521

Similar Items