Saved in:
Bibliographic Details
Main Authors: Huang, Ying-Hsiang, Gong, Claire, Shaji, Shreya, Yan, Alison, Harka, Leslie, Du, Albert, Gopal, Anjali, Klein, Samuel J, Shen, Shannon Zejiang, Phillips, Mark, Owens, Trevor, Deeds, Kyle, Lee, Benjamin Charles Germain
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.11010
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916023601463296
author Huang, Ying-Hsiang
Gong, Claire
Shaji, Shreya
Yan, Alison
Harka, Leslie
Du, Albert
Gopal, Anjali
Klein, Samuel J
Shen, Shannon Zejiang
Phillips, Mark
Owens, Trevor
Deeds, Kyle
Lee, Benjamin Charles Germain
author_facet Huang, Ying-Hsiang
Gong, Claire
Shaji, Shreya
Yan, Alison
Harka, Leslie
Du, Albert
Gopal, Anjali
Klein, Samuel J
Shen, Shannon Zejiang
Phillips, Mark
Owens, Trevor
Deeds, Kyle
Lee, Benjamin Charles Germain
contents Efforts over the past three decades have produced web archives containing billions of webpage snapshots and petabytes of data. The End of Term Web Archive alone contains, among other file types, millions of PDFs produced by the federal government. While preservation with web archives has been successful, significant challenges for access and discoverability remain. For example, current affordances for browsing the End of Term PDFs are limited to downloading and browsing individual PDFs, as well as performing basic keyword search across them. In this paper, we introduce GovScape, a public search system that supports multimodal searches across 10,015,993 federal government PDFs from the 2020 End of Term crawl (70,958,487 total PDF pages) - to our knowledge, all renderable PDFs in the 2020 crawl that are 50 pages or under. GovScape supports four primary forms of search over these 10 million PDFs: in addition to providing (1) filter conditions over metadata facets including domain and crawl date and (2) exact text search against the PDF text, we provide (3) semantic text search and (4) visual search against the PDFs across individual pages, enabling users to structure queries such as "redacted documents" or "pie charts." We detail the constituent components of GovScape, including the search affordances, embedding pipeline, system architecture, and open source codebase. Significantly, the total estimated compute cost for GovScape's pre-processing pipeline for 10 million PDFs was approximately $1,500, equivalent to 47,000 PDF pages per dollar spent on compute, demonstrating the potential for immediate scalability. Accordingly, we outline steps that we have already begun pursuing toward multimodal search at the 100+ million PDF scale. GovScape can be found at https://www.govscape.net.
format Preprint
id arxiv_https___arxiv_org_abs_2511_11010
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle GovScape: A Public Multimodal Search System for 70 Million Pages of Government PDFs
Huang, Ying-Hsiang
Gong, Claire
Shaji, Shreya
Yan, Alison
Harka, Leslie
Du, Albert
Gopal, Anjali
Klein, Samuel J
Shen, Shannon Zejiang
Phillips, Mark
Owens, Trevor
Deeds, Kyle
Lee, Benjamin Charles Germain
Information Retrieval
Digital Libraries
Efforts over the past three decades have produced web archives containing billions of webpage snapshots and petabytes of data. The End of Term Web Archive alone contains, among other file types, millions of PDFs produced by the federal government. While preservation with web archives has been successful, significant challenges for access and discoverability remain. For example, current affordances for browsing the End of Term PDFs are limited to downloading and browsing individual PDFs, as well as performing basic keyword search across them. In this paper, we introduce GovScape, a public search system that supports multimodal searches across 10,015,993 federal government PDFs from the 2020 End of Term crawl (70,958,487 total PDF pages) - to our knowledge, all renderable PDFs in the 2020 crawl that are 50 pages or under. GovScape supports four primary forms of search over these 10 million PDFs: in addition to providing (1) filter conditions over metadata facets including domain and crawl date and (2) exact text search against the PDF text, we provide (3) semantic text search and (4) visual search against the PDFs across individual pages, enabling users to structure queries such as "redacted documents" or "pie charts." We detail the constituent components of GovScape, including the search affordances, embedding pipeline, system architecture, and open source codebase. Significantly, the total estimated compute cost for GovScape's pre-processing pipeline for 10 million PDFs was approximately $1,500, equivalent to 47,000 PDF pages per dollar spent on compute, demonstrating the potential for immediate scalability. Accordingly, we outline steps that we have already begun pursuing toward multimodal search at the 100+ million PDF scale. GovScape can be found at https://www.govscape.net.
title GovScape: A Public Multimodal Search System for 70 Million Pages of Government PDFs
topic Information Retrieval
Digital Libraries
url https://arxiv.org/abs/2511.11010