Saved in:
Bibliographic Details
Main Authors: Liu, Shicheng, Sun, Kai, Fu, Lisheng, Chen, Xilun, Zhang, Xinyuan, Lin, Zhaojiang, Shao, Rulin, Liu, Yue, Kumar, Anuj, Yih, Wen-tau, Dong, Xin Luna
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.01832
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911189003403264
author Liu, Shicheng
Sun, Kai
Fu, Lisheng
Chen, Xilun
Zhang, Xinyuan
Lin, Zhaojiang
Shao, Rulin
Liu, Yue
Kumar, Anuj
Yih, Wen-tau
Dong, Xin Luna
author_facet Liu, Shicheng
Sun, Kai
Fu, Lisheng
Chen, Xilun
Zhang, Xinyuan
Lin, Zhaojiang
Shao, Rulin
Liu, Yue
Kumar, Anuj
Yih, Wen-tau
Dong, Xin Luna
contents Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging. Existing methods either lack generalization or are resource-intensive due to per-page LLM inference. In this paper, we introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework that leverages layout similarity across webpages within the same site as a reward signal. Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages. Our approach further improves by iteratively training on synthetic annotations from in-the-wild CommonCrawl data. Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o, enabling scalable and resource-efficient web information extraction.
format Preprint
id arxiv_https___arxiv_org_abs_2510_01832
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning
Liu, Shicheng
Sun, Kai
Fu, Lisheng
Chen, Xilun
Zhang, Xinyuan
Lin, Zhaojiang
Shao, Rulin
Liu, Yue
Kumar, Anuj
Yih, Wen-tau
Dong, Xin Luna
Computation and Language
Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging. Existing methods either lack generalization or are resource-intensive due to per-page LLM inference. In this paper, we introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework that leverages layout similarity across webpages within the same site as a reward signal. Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages. Our approach further improves by iteratively training on synthetic annotations from in-the-wild CommonCrawl data. Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o, enabling scalable and resource-efficient web information extraction.
title SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning
topic Computation and Language
url https://arxiv.org/abs/2510.01832