Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Shicheng, Sun, Kai, Fu, Lisheng, Chen, Xilun, Zhang, Xinyuan, Lin, Zhaojiang, Shao, Rulin, Liu, Yue, Kumar, Anuj, Yih, Wen-tau, Dong, Xin Luna
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2510.01832
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911189003403264
author	Liu, Shicheng Sun, Kai Fu, Lisheng Chen, Xilun Zhang, Xinyuan Lin, Zhaojiang Shao, Rulin Liu, Yue Kumar, Anuj Yih, Wen-tau Dong, Xin Luna
author_facet	Liu, Shicheng Sun, Kai Fu, Lisheng Chen, Xilun Zhang, Xinyuan Lin, Zhaojiang Shao, Rulin Liu, Yue Kumar, Anuj Yih, Wen-tau Dong, Xin Luna
contents	Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging. Existing methods either lack generalization or are resource-intensive due to per-page LLM inference. In this paper, we introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework that leverages layout similarity across webpages within the same site as a reward signal. Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages. Our approach further improves by iteratively training on synthetic annotations from in-the-wild CommonCrawl data. Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o, enabling scalable and resource-efficient web information extraction.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_01832
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning Liu, Shicheng Sun, Kai Fu, Lisheng Chen, Xilun Zhang, Xinyuan Lin, Zhaojiang Shao, Rulin Liu, Yue Kumar, Anuj Yih, Wen-tau Dong, Xin Luna Computation and Language Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging. Existing methods either lack generalization or are resource-intensive due to per-page LLM inference. In this paper, we introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework that leverages layout similarity across webpages within the same site as a reward signal. Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages. Our approach further improves by iteratively training on synthetic annotations from in-the-wild CommonCrawl data. Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o, enabling scalable and resource-efficient web information extraction.
title	SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning
topic	Computation and Language
url	https://arxiv.org/abs/2510.01832

Similar Items