Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.08910 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866908875311022080 |
|---|---|
| author | Wang, Hexuan Ren, Yaxuan Bommireddypalli, Srikar Chen, Shuxian Prabhudesai, Adarsh Zhou, Rongkun Baral, Elina Koehn, Philipp |
| author_facet | Wang, Hexuan Ren, Yaxuan Bommireddypalli, Srikar Chen, Shuxian Prabhudesai, Adarsh Zhou, Rongkun Baral, Elina Koehn, Philipp |
| contents | We introduce SciTaRC, an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation. We show that current state-of-the-art AI models fail on at least 23% of these questions, a gap that remains significant even for highly capable open-weight models like Llama-3.3-70B-Instruct, which fails on 65.5% of the tasks. Our analysis reveals a universal "execution bottleneck": both code and language models struggle to faithfully execute plans, even when provided with correct strategies. Specifically, code-based methods prove brittle on raw scientific tables, while natural language reasoning primarily fails due to initial comprehension issues and calculation errors. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2603_08910 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation Wang, Hexuan Ren, Yaxuan Bommireddypalli, Srikar Chen, Shuxian Prabhudesai, Adarsh Zhou, Rongkun Baral, Elina Koehn, Philipp Computation and Language We introduce SciTaRC, an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation. We show that current state-of-the-art AI models fail on at least 23% of these questions, a gap that remains significant even for highly capable open-weight models like Llama-3.3-70B-Instruct, which fails on 65.5% of the tasks. Our analysis reveals a universal "execution bottleneck": both code and language models struggle to faithfully execute plans, even when provided with correct strategies. Specifically, code-based methods prove brittle on raw scientific tables, while natural language reasoning primarily fails due to initial comprehension issues and calculation errors. |
| title | SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation |
| topic | Computation and Language |
| url | https://arxiv.org/abs/2603.08910 |