Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Hexuan, Ren, Yaxuan, Bommireddypalli, Srikar, Chen, Shuxian, Prabhudesai, Adarsh, Zhou, Rongkun, Baral, Elina, Koehn, Philipp
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2603.08910
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908875311022080
author	Wang, Hexuan Ren, Yaxuan Bommireddypalli, Srikar Chen, Shuxian Prabhudesai, Adarsh Zhou, Rongkun Baral, Elina Koehn, Philipp
author_facet	Wang, Hexuan Ren, Yaxuan Bommireddypalli, Srikar Chen, Shuxian Prabhudesai, Adarsh Zhou, Rongkun Baral, Elina Koehn, Philipp
contents	We introduce SciTaRC, an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation. We show that current state-of-the-art AI models fail on at least 23% of these questions, a gap that remains significant even for highly capable open-weight models like Llama-3.3-70B-Instruct, which fails on 65.5% of the tasks. Our analysis reveals a universal "execution bottleneck": both code and language models struggle to faithfully execute plans, even when provided with correct strategies. Specifically, code-based methods prove brittle on raw scientific tables, while natural language reasoning primarily fails due to initial comprehension issues and calculation errors.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_08910
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation Wang, Hexuan Ren, Yaxuan Bommireddypalli, Srikar Chen, Shuxian Prabhudesai, Adarsh Zhou, Rongkun Baral, Elina Koehn, Philipp Computation and Language We introduce SciTaRC, an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation. We show that current state-of-the-art AI models fail on at least 23% of these questions, a gap that remains significant even for highly capable open-weight models like Llama-3.3-70B-Instruct, which fails on 65.5% of the tasks. Our analysis reveals a universal "execution bottleneck": both code and language models struggle to faithfully execute plans, even when provided with correct strategies. Specifically, code-based methods prove brittle on raw scientific tables, while natural language reasoning primarily fails due to initial comprehension issues and calculation errors.
title	SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation
topic	Computation and Language
url	https://arxiv.org/abs/2603.08910

Similar Items