Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Hsu, Hsin-Ling, Lin, Ping-Sheng, Lin, Jing-Di, Tzeng, Jengnan
Format:	Preprint
Published:	2025
Subjects:	Information Retrieval
Online Access:	https://arxiv.org/abs/2503.08452
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910924563021824
author	Hsu, Hsin-Ling Lin, Ping-Sheng Lin, Jing-Di Tzeng, Jengnan
author_facet	Hsu, Hsin-Ling Lin, Ping-Sheng Lin, Jing-Di Tzeng, Jengnan
contents	Hybrid Retrieval systems, combining Sparse and Dense Retrieval methods, struggle with Traditional Chinese non-narrative documents due to their complex formatting, rich vocabulary, and the insufficient understanding of Chinese synonyms by common embedding models. Previous approaches inadequately address the dual needs of these systems, focusing mainly on general text quality improvement rather than optimizing for retrieval. We propose Knowledge-Aware Preprocessing (KAP), a novel framework that transforms noisy OCR outputs into retrieval-optimized text. KAP adopts a two-stage approach: it first extracts text using OCR, then employs Multimodal Large Language Models to refine the output by integrating visual information from the original documents. This design reduces OCR noise, reconstructs structural elements, and formats the text to satisfy the distinct requirements of sparse and dense retrieval. Empirical results demonstrate that KAP consistently and significantly outperforms conventional preprocessing approaches. Our code is available at https://github.com/JustinHsu1019/KAP.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_08452
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	KAP: MLLM-assisted OCR Text Enhancement for Hybrid Retrieval in Chinese Non-Narrative Documents Hsu, Hsin-Ling Lin, Ping-Sheng Lin, Jing-Di Tzeng, Jengnan Information Retrieval Hybrid Retrieval systems, combining Sparse and Dense Retrieval methods, struggle with Traditional Chinese non-narrative documents due to their complex formatting, rich vocabulary, and the insufficient understanding of Chinese synonyms by common embedding models. Previous approaches inadequately address the dual needs of these systems, focusing mainly on general text quality improvement rather than optimizing for retrieval. We propose Knowledge-Aware Preprocessing (KAP), a novel framework that transforms noisy OCR outputs into retrieval-optimized text. KAP adopts a two-stage approach: it first extracts text using OCR, then employs Multimodal Large Language Models to refine the output by integrating visual information from the original documents. This design reduces OCR noise, reconstructs structural elements, and formats the text to satisfy the distinct requirements of sparse and dense retrieval. Empirical results demonstrate that KAP consistently and significantly outperforms conventional preprocessing approaches. Our code is available at https://github.com/JustinHsu1019/KAP.
title	KAP: MLLM-assisted OCR Text Enhancement for Hybrid Retrieval in Chinese Non-Narrative Documents
topic	Information Retrieval
url	https://arxiv.org/abs/2503.08452

Similar Items