Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Devasier, Jacob, Putta, Akshith, Wang, Qing, Moses, Alankrit, Li, Chengkai
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2601.17232
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908785479516160
author	Devasier, Jacob Putta, Akshith Wang, Qing Moses, Alankrit Li, Chengkai
author_facet	Devasier, Jacob Putta, Akshith Wang, Qing Moses, Alankrit Li, Chengkai
contents	Automated fact-checking benchmarks have largely ignored the challenge of verifying claims against real-world, high-volume structured data, instead focusing on small, curated tables. We introduce a new large-scale, multilingual dataset to address this critical gap. It contains 78,503 synthetic claims grounded in 434 complex OECD tables, which average over 500K rows each. We propose a novel, frame-guided methodology where algorithms programmatically select significant data points based on six semantic frames to generate realistic claims in English, Chinese, Spanish, and Hindi. Crucially, we demonstrate through knowledge-probing experiments that LLMs have not memorized these facts, forcing systems to perform genuine retrieval and reasoning rather than relying on parameterized knowledge. We provide a baseline SQL-generation system and show that our benchmark is highly challenging. Our analysis identifies evidence retrieval as the primary bottleneck, with models struggling to find the correct data in massive tables. This dataset provides a critical new resource for advancing research on this unsolved, real-world problem.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_17232
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Frame-Guided Synthetic Claim Generation for Automatic Fact-Checking Using High-Volume Tabular Data Devasier, Jacob Putta, Akshith Wang, Qing Moses, Alankrit Li, Chengkai Computation and Language Automated fact-checking benchmarks have largely ignored the challenge of verifying claims against real-world, high-volume structured data, instead focusing on small, curated tables. We introduce a new large-scale, multilingual dataset to address this critical gap. It contains 78,503 synthetic claims grounded in 434 complex OECD tables, which average over 500K rows each. We propose a novel, frame-guided methodology where algorithms programmatically select significant data points based on six semantic frames to generate realistic claims in English, Chinese, Spanish, and Hindi. Crucially, we demonstrate through knowledge-probing experiments that LLMs have not memorized these facts, forcing systems to perform genuine retrieval and reasoning rather than relying on parameterized knowledge. We provide a baseline SQL-generation system and show that our benchmark is highly challenging. Our analysis identifies evidence retrieval as the primary bottleneck, with models struggling to find the correct data in massive tables. This dataset provides a critical new resource for advancing research on this unsolved, real-world problem.
title	Frame-Guided Synthetic Claim Generation for Automatic Fact-Checking Using High-Volume Tabular Data
topic	Computation and Language
url	https://arxiv.org/abs/2601.17232

Similar Items