Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	González, José Ángel, Obrador, Ian Borrego, Herrero, Álvaro Romo, Sarvazyan, Areg Mikael, Chinea-Ríos, Mara, Basile, Angelo, Franco-Salvador, Marc
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2504.16921
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916704065421312
author	González, José Ángel Obrador, Ian Borrego Herrero, Álvaro Romo Sarvazyan, Areg Mikael Chinea-Ríos, Mara Basile, Angelo Franco-Salvador, Marc
author_facet	González, José Ángel Obrador, Ian Borrego Herrero, Álvaro Romo Sarvazyan, Areg Mikael Chinea-Ríos, Mara Basile, Angelo Franco-Salvador, Marc
contents	Large Language Models (LLMs) remain difficult to evaluate comprehensively, particularly for languages other than English, where high-quality data is often limited. Existing benchmarks and leaderboards are predominantly English-centric, with only a few addressing other languages. These benchmarks fall short in several key areas: they overlook the diversity of language varieties, prioritize fundamental Natural Language Processing (NLP) capabilities over tasks of industrial relevance, and are static. With these aspects in mind, we present IberBench, a comprehensive and extensible benchmark designed to assess LLM performance on both fundamental and industry-relevant NLP tasks, in languages spoken across the Iberian Peninsula and Ibero-America. IberBench integrates 101 datasets from evaluation campaigns and recent benchmarks, covering 22 task categories such as sentiment and emotion analysis, toxicity detection, and summarization. The benchmark addresses key limitations in current evaluation practices, such as the lack of linguistic diversity and static evaluation setups by enabling continual updates and community-driven model and dataset submissions moderated by a committee of experts. We evaluate 23 LLMs ranging from 100 million to 14 billion parameters and provide empirical insights into their strengths and limitations. Our findings indicate that (i) LLMs perform worse on industry-relevant tasks than in fundamental ones, (ii) performance is on average lower for Galician and Basque, (iii) some tasks show results close to random, and (iv) in other tasks LLMs perform above random but below shared task systems. IberBench offers open-source implementations for the entire evaluation pipeline, including dataset normalization and hosting, incremental evaluation of LLMs, and a publicly accessible leaderboard.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_16921
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	IberBench: LLM Evaluation on Iberian Languages González, José Ángel Obrador, Ian Borrego Herrero, Álvaro Romo Sarvazyan, Areg Mikael Chinea-Ríos, Mara Basile, Angelo Franco-Salvador, Marc Computation and Language Large Language Models (LLMs) remain difficult to evaluate comprehensively, particularly for languages other than English, where high-quality data is often limited. Existing benchmarks and leaderboards are predominantly English-centric, with only a few addressing other languages. These benchmarks fall short in several key areas: they overlook the diversity of language varieties, prioritize fundamental Natural Language Processing (NLP) capabilities over tasks of industrial relevance, and are static. With these aspects in mind, we present IberBench, a comprehensive and extensible benchmark designed to assess LLM performance on both fundamental and industry-relevant NLP tasks, in languages spoken across the Iberian Peninsula and Ibero-America. IberBench integrates 101 datasets from evaluation campaigns and recent benchmarks, covering 22 task categories such as sentiment and emotion analysis, toxicity detection, and summarization. The benchmark addresses key limitations in current evaluation practices, such as the lack of linguistic diversity and static evaluation setups by enabling continual updates and community-driven model and dataset submissions moderated by a committee of experts. We evaluate 23 LLMs ranging from 100 million to 14 billion parameters and provide empirical insights into their strengths and limitations. Our findings indicate that (i) LLMs perform worse on industry-relevant tasks than in fundamental ones, (ii) performance is on average lower for Galician and Basque, (iii) some tasks show results close to random, and (iv) in other tasks LLMs perform above random but below shared task systems. IberBench offers open-source implementations for the entire evaluation pipeline, including dataset normalization and hosting, incremental evaluation of LLMs, and a publicly accessible leaderboard.
title	IberBench: LLM Evaluation on Iberian Languages
topic	Computation and Language
url	https://arxiv.org/abs/2504.16921

Similar Items