Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kazemi, Mehran, Fatemi, Bahare, Bansal, Hritik, Palowitch, John, Anastasiou, Chrysovalantis, Mehta, Sanket Vaibhav, Jain, Lalit K., Aglietti, Virginia, Jindal, Disha, Chen, Peter, Dikkala, Nishanth, Tyen, Gladys, Liu, Xin, Shalit, Uri, Chiappa, Silvia, Olszewska, Kate, Tay, Yi, Tran, Vinh Q., Le, Quoc V., Firat, Orhan
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2502.19187
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910928831774720
author	Kazemi, Mehran Fatemi, Bahare Bansal, Hritik Palowitch, John Anastasiou, Chrysovalantis Mehta, Sanket Vaibhav Jain, Lalit K. Aglietti, Virginia Jindal, Disha Chen, Peter Dikkala, Nishanth Tyen, Gladys Liu, Xin Shalit, Uri Chiappa, Silvia Olszewska, Kate Tay, Yi Tran, Vinh Q. Le, Quoc V. Firat, Orhan
author_facet	Kazemi, Mehran Fatemi, Bahare Bansal, Hritik Palowitch, John Anastasiou, Chrysovalantis Mehta, Sanket Vaibhav Jain, Lalit K. Aglietti, Virginia Jindal, Disha Chen, Peter Dikkala, Nishanth Tyen, Gladys Liu, Xin Shalit, Uri Chiappa, Silvia Olszewska, Kate Tay, Yi Tran, Vinh Q. Le, Quoc V. Firat, Orhan
contents	Large language models (LLMs) are increasingly deployed in everyday applications, demanding robust general reasoning capabilities and diverse reasoning skillset. However, current LLM reasoning benchmarks predominantly focus on mathematical and coding abilities, leaving a gap in evaluating broader reasoning proficiencies. One particular exception is the BIG-Bench dataset, which has served as a crucial benchmark for evaluating the general reasoning capabilities of LLMs, thanks to its diverse set of challenging tasks that allowed for a comprehensive assessment of general reasoning across various skills within a unified framework. However, recent advances in LLMs have led to saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH). State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus diminishing its utility. To address this limitation, we introduce BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation. BBEH replaces each task in BBH with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty. We evaluate various models on BBEH and observe a (harmonic) average accuracy of 9.8\% for the best general-purpose model and 44.8\% for the best reasoning-specialized model, indicating substantial room for improvement and highlighting the ongoing challenge of achieving robust general reasoning in LLMs. We release BBEH publicly at: https://github.com/google-deepmind/bbeh.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_19187
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	BIG-Bench Extra Hard Kazemi, Mehran Fatemi, Bahare Bansal, Hritik Palowitch, John Anastasiou, Chrysovalantis Mehta, Sanket Vaibhav Jain, Lalit K. Aglietti, Virginia Jindal, Disha Chen, Peter Dikkala, Nishanth Tyen, Gladys Liu, Xin Shalit, Uri Chiappa, Silvia Olszewska, Kate Tay, Yi Tran, Vinh Q. Le, Quoc V. Firat, Orhan Computation and Language Large language models (LLMs) are increasingly deployed in everyday applications, demanding robust general reasoning capabilities and diverse reasoning skillset. However, current LLM reasoning benchmarks predominantly focus on mathematical and coding abilities, leaving a gap in evaluating broader reasoning proficiencies. One particular exception is the BIG-Bench dataset, which has served as a crucial benchmark for evaluating the general reasoning capabilities of LLMs, thanks to its diverse set of challenging tasks that allowed for a comprehensive assessment of general reasoning across various skills within a unified framework. However, recent advances in LLMs have led to saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH). State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus diminishing its utility. To address this limitation, we introduce BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation. BBEH replaces each task in BBH with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty. We evaluate various models on BBEH and observe a (harmonic) average accuracy of 9.8\% for the best general-purpose model and 44.8\% for the best reasoning-specialized model, indicating substantial room for improvement and highlighting the ongoing challenge of achieving robust general reasoning in LLMs. We release BBEH publicly at: https://github.com/google-deepmind/bbeh.
title	BIG-Bench Extra Hard
topic	Computation and Language
url	https://arxiv.org/abs/2502.19187

Similar Items