Saved in:
Bibliographic Details
Main Authors: Wang, Hexuan, Ren, Yaxuan, Bommireddypalli, Srikar, Chen, Shuxian, Prabhudesai, Adarsh, Zhou, Rongkun, Baral, Elina, Koehn, Philipp
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.08910
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908875311022080
author Wang, Hexuan
Ren, Yaxuan
Bommireddypalli, Srikar
Chen, Shuxian
Prabhudesai, Adarsh
Zhou, Rongkun
Baral, Elina
Koehn, Philipp
author_facet Wang, Hexuan
Ren, Yaxuan
Bommireddypalli, Srikar
Chen, Shuxian
Prabhudesai, Adarsh
Zhou, Rongkun
Baral, Elina
Koehn, Philipp
contents We introduce SciTaRC, an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation. We show that current state-of-the-art AI models fail on at least 23% of these questions, a gap that remains significant even for highly capable open-weight models like Llama-3.3-70B-Instruct, which fails on 65.5% of the tasks. Our analysis reveals a universal "execution bottleneck": both code and language models struggle to faithfully execute plans, even when provided with correct strategies. Specifically, code-based methods prove brittle on raw scientific tables, while natural language reasoning primarily fails due to initial comprehension issues and calculation errors.
format Preprint
id arxiv_https___arxiv_org_abs_2603_08910
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation
Wang, Hexuan
Ren, Yaxuan
Bommireddypalli, Srikar
Chen, Shuxian
Prabhudesai, Adarsh
Zhou, Rongkun
Baral, Elina
Koehn, Philipp
Computation and Language
We introduce SciTaRC, an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation. We show that current state-of-the-art AI models fail on at least 23% of these questions, a gap that remains significant even for highly capable open-weight models like Llama-3.3-70B-Instruct, which fails on 65.5% of the tasks. Our analysis reveals a universal "execution bottleneck": both code and language models struggle to faithfully execute plans, even when provided with correct strategies. Specifically, code-based methods prove brittle on raw scientific tables, while natural language reasoning primarily fails due to initial comprehension issues and calculation errors.
title SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation
topic Computation and Language
url https://arxiv.org/abs/2603.08910