Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Shim, Gyuho, Hong, Seongtae, Lim, Heuiseok
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.08115
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910116176986112
author	Shim, Gyuho Hong, Seongtae Lim, Heuiseok
author_facet	Shim, Gyuho Hong, Seongtae Lim, Heuiseok
contents	Recent advances in Large Language Models (LLMs) have significantly improved the field of Document AI, demonstrating remarkable performance on document understanding tasks such as question answering. However, existing approaches primarily focus on solving specific tasks, lacking the capability to structurally organize and manage document information. To address this limitation, we propose Revise, a framework that systematically corrects errors introduced by OCR at the character, word, and structural levels. Specifically, Revise employs a comprehensive hierarchical taxonomy of common OCR errors and a synthetic data generation strategy that realistically simulates such errors to train an effective correction model. Experimental results demonstrate that Revise effectively corrects OCR outputs, enabling more structured representation and systematic management of document contents. Consequently, our method significantly enhances downstream performance in document retrieval and question answering tasks, highlighting the potential to overcome the structural management limitations of existing Document AI frameworks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_08115
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Revise: A Framework for Revising OCRed text in Practical Information Systems with Data Contamination Strategy Shim, Gyuho Hong, Seongtae Lim, Heuiseok Artificial Intelligence Recent advances in Large Language Models (LLMs) have significantly improved the field of Document AI, demonstrating remarkable performance on document understanding tasks such as question answering. However, existing approaches primarily focus on solving specific tasks, lacking the capability to structurally organize and manage document information. To address this limitation, we propose Revise, a framework that systematically corrects errors introduced by OCR at the character, word, and structural levels. Specifically, Revise employs a comprehensive hierarchical taxonomy of common OCR errors and a synthetic data generation strategy that realistically simulates such errors to train an effective correction model. Experimental results demonstrate that Revise effectively corrects OCR outputs, enabling more structured representation and systematic management of document contents. Consequently, our method significantly enhances downstream performance in document retrieval and question answering tasks, highlighting the potential to overcome the structural management limitations of existing Document AI frameworks.
title	Revise: A Framework for Revising OCRed text in Practical Information Systems with Data Contamination Strategy
topic	Artificial Intelligence
url	https://arxiv.org/abs/2604.08115

Similar Items