Saved in:
Bibliographic Details
Main Authors: Kazemi, Mehran, Fatemi, Bahare, Bansal, Hritik, Palowitch, John, Anastasiou, Chrysovalantis, Mehta, Sanket Vaibhav, Jain, Lalit K., Aglietti, Virginia, Jindal, Disha, Chen, Peter, Dikkala, Nishanth, Tyen, Gladys, Liu, Xin, Shalit, Uri, Chiappa, Silvia, Olszewska, Kate, Tay, Yi, Tran, Vinh Q., Le, Quoc V., Firat, Orhan
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.19187
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910928831774720
author Kazemi, Mehran
Fatemi, Bahare
Bansal, Hritik
Palowitch, John
Anastasiou, Chrysovalantis
Mehta, Sanket Vaibhav
Jain, Lalit K.
Aglietti, Virginia
Jindal, Disha
Chen, Peter
Dikkala, Nishanth
Tyen, Gladys
Liu, Xin
Shalit, Uri
Chiappa, Silvia
Olszewska, Kate
Tay, Yi
Tran, Vinh Q.
Le, Quoc V.
Firat, Orhan
author_facet Kazemi, Mehran
Fatemi, Bahare
Bansal, Hritik
Palowitch, John
Anastasiou, Chrysovalantis
Mehta, Sanket Vaibhav
Jain, Lalit K.
Aglietti, Virginia
Jindal, Disha
Chen, Peter
Dikkala, Nishanth
Tyen, Gladys
Liu, Xin
Shalit, Uri
Chiappa, Silvia
Olszewska, Kate
Tay, Yi
Tran, Vinh Q.
Le, Quoc V.
Firat, Orhan
contents Large language models (LLMs) are increasingly deployed in everyday applications, demanding robust general reasoning capabilities and diverse reasoning skillset. However, current LLM reasoning benchmarks predominantly focus on mathematical and coding abilities, leaving a gap in evaluating broader reasoning proficiencies. One particular exception is the BIG-Bench dataset, which has served as a crucial benchmark for evaluating the general reasoning capabilities of LLMs, thanks to its diverse set of challenging tasks that allowed for a comprehensive assessment of general reasoning across various skills within a unified framework. However, recent advances in LLMs have led to saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH). State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus diminishing its utility. To address this limitation, we introduce BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation. BBEH replaces each task in BBH with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty. We evaluate various models on BBEH and observe a (harmonic) average accuracy of 9.8\% for the best general-purpose model and 44.8\% for the best reasoning-specialized model, indicating substantial room for improvement and highlighting the ongoing challenge of achieving robust general reasoning in LLMs. We release BBEH publicly at: https://github.com/google-deepmind/bbeh.
format Preprint
id arxiv_https___arxiv_org_abs_2502_19187
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle BIG-Bench Extra Hard
Kazemi, Mehran
Fatemi, Bahare
Bansal, Hritik
Palowitch, John
Anastasiou, Chrysovalantis
Mehta, Sanket Vaibhav
Jain, Lalit K.
Aglietti, Virginia
Jindal, Disha
Chen, Peter
Dikkala, Nishanth
Tyen, Gladys
Liu, Xin
Shalit, Uri
Chiappa, Silvia
Olszewska, Kate
Tay, Yi
Tran, Vinh Q.
Le, Quoc V.
Firat, Orhan
Computation and Language
Large language models (LLMs) are increasingly deployed in everyday applications, demanding robust general reasoning capabilities and diverse reasoning skillset. However, current LLM reasoning benchmarks predominantly focus on mathematical and coding abilities, leaving a gap in evaluating broader reasoning proficiencies. One particular exception is the BIG-Bench dataset, which has served as a crucial benchmark for evaluating the general reasoning capabilities of LLMs, thanks to its diverse set of challenging tasks that allowed for a comprehensive assessment of general reasoning across various skills within a unified framework. However, recent advances in LLMs have led to saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH). State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus diminishing its utility. To address this limitation, we introduce BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation. BBEH replaces each task in BBH with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty. We evaluate various models on BBEH and observe a (harmonic) average accuracy of 9.8\% for the best general-purpose model and 44.8\% for the best reasoning-specialized model, indicating substantial room for improvement and highlighting the ongoing challenge of achieving robust general reasoning in LLMs. We release BBEH publicly at: https://github.com/google-deepmind/bbeh.
title BIG-Bench Extra Hard
topic Computation and Language
url https://arxiv.org/abs/2502.19187