Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Momentè, Filippo, Suglia, Alessandro, Giulianelli, Mario, Ferrari, Ambra, Koller, Alexander, Lemon, Oliver, Schlangen, David, Fernández, Raquel, Bernardi, Raffaella
Formato:	Preprint
Publicado:	2025
Materias:	Computation and Language
Acceso en línea:	https://arxiv.org/abs/2502.14359
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866909803582849024
author	Momentè, Filippo Suglia, Alessandro Giulianelli, Mario Ferrari, Ambra Koller, Alexander Lemon, Oliver Schlangen, David Fernández, Raquel Bernardi, Raffaella
author_facet	Momentè, Filippo Suglia, Alessandro Giulianelli, Mario Ferrari, Ambra Koller, Alexander Lemon, Oliver Schlangen, David Fernández, Raquel Bernardi, Raffaella
contents	We examine three evaluation paradigms: standard benchmarks (e.g., MMLU and BBH), interactive games (e.g., Signalling Games or Taboo), and cognitive tests (e.g., for working memory or theory of mind). First, we investigate which of the former two-benchmarks or games-is most effective at discriminating LLMs of varying quality. Then, inspired by human cognitive assessments, we compile a suite of targeted tests that measure cognitive abilities deemed essential for effective language use, and we investigate their correlation with model performance in benchmarks and games. Our analyses reveal that interactive games are superior to standard benchmarks in discriminating models. Causal and logical reasoning correlate with both static and interactive tests, while differences emerge regarding core executive functions and social/emotional skills, which correlate more with games. We advocate for the development of new interactive benchmarks and targeted cognitive tasks inspired by assessing human abilities but designed specifically for LLMs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_14359
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests Momentè, Filippo Suglia, Alessandro Giulianelli, Mario Ferrari, Ambra Koller, Alexander Lemon, Oliver Schlangen, David Fernández, Raquel Bernardi, Raffaella Computation and Language We examine three evaluation paradigms: standard benchmarks (e.g., MMLU and BBH), interactive games (e.g., Signalling Games or Taboo), and cognitive tests (e.g., for working memory or theory of mind). First, we investigate which of the former two-benchmarks or games-is most effective at discriminating LLMs of varying quality. Then, inspired by human cognitive assessments, we compile a suite of targeted tests that measure cognitive abilities deemed essential for effective language use, and we investigate their correlation with model performance in benchmarks and games. Our analyses reveal that interactive games are superior to standard benchmarks in discriminating models. Causal and logical reasoning correlate with both static and interactive tests, while differences emerge regarding core executive functions and social/emotional skills, which correlate more with games. We advocate for the development of new interactive benchmarks and targeted cognitive tasks inspired by assessing human abilities but designed specifically for LLMs.
title	Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests
topic	Computation and Language
url	https://arxiv.org/abs/2502.14359

Ejemplares similares