Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Hexuan, Ren, Yaxuan, Bommireddypalli, Srikar, Chen, Shuxian, Prabhudesai, Adarsh, Zhou, Rongkun, Baral, Elina, Koehn, Philipp
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2603.08910
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

We introduce SciTaRC, an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation. We show that current state-of-the-art AI models fail on at least 23% of these questions, a gap that remains significant even for highly capable open-weight models like Llama-3.3-70B-Instruct, which fails on 65.5% of the tasks. Our analysis reveals a universal "execution bottleneck": both code and language models struggle to faithfully execute plans, even when provided with correct strategies. Specifically, code-based methods prove brittle on raw scientific tables, while natural language reasoning primarily fails due to initial comprehension issues and calculation errors.

Similar Items