Saved in:
Bibliographic Details
Main Authors: Hsu, Hsin-Ling, Lin, Ping-Sheng, Lin, Jing-Di, Tzeng, Jengnan
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.08452
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910924563021824
author Hsu, Hsin-Ling
Lin, Ping-Sheng
Lin, Jing-Di
Tzeng, Jengnan
author_facet Hsu, Hsin-Ling
Lin, Ping-Sheng
Lin, Jing-Di
Tzeng, Jengnan
contents Hybrid Retrieval systems, combining Sparse and Dense Retrieval methods, struggle with Traditional Chinese non-narrative documents due to their complex formatting, rich vocabulary, and the insufficient understanding of Chinese synonyms by common embedding models. Previous approaches inadequately address the dual needs of these systems, focusing mainly on general text quality improvement rather than optimizing for retrieval. We propose Knowledge-Aware Preprocessing (KAP), a novel framework that transforms noisy OCR outputs into retrieval-optimized text. KAP adopts a two-stage approach: it first extracts text using OCR, then employs Multimodal Large Language Models to refine the output by integrating visual information from the original documents. This design reduces OCR noise, reconstructs structural elements, and formats the text to satisfy the distinct requirements of sparse and dense retrieval. Empirical results demonstrate that KAP consistently and significantly outperforms conventional preprocessing approaches. Our code is available at https://github.com/JustinHsu1019/KAP.
format Preprint
id arxiv_https___arxiv_org_abs_2503_08452
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle KAP: MLLM-assisted OCR Text Enhancement for Hybrid Retrieval in Chinese Non-Narrative Documents
Hsu, Hsin-Ling
Lin, Ping-Sheng
Lin, Jing-Di
Tzeng, Jengnan
Information Retrieval
Hybrid Retrieval systems, combining Sparse and Dense Retrieval methods, struggle with Traditional Chinese non-narrative documents due to their complex formatting, rich vocabulary, and the insufficient understanding of Chinese synonyms by common embedding models. Previous approaches inadequately address the dual needs of these systems, focusing mainly on general text quality improvement rather than optimizing for retrieval. We propose Knowledge-Aware Preprocessing (KAP), a novel framework that transforms noisy OCR outputs into retrieval-optimized text. KAP adopts a two-stage approach: it first extracts text using OCR, then employs Multimodal Large Language Models to refine the output by integrating visual information from the original documents. This design reduces OCR noise, reconstructs structural elements, and formats the text to satisfy the distinct requirements of sparse and dense retrieval. Empirical results demonstrate that KAP consistently and significantly outperforms conventional preprocessing approaches. Our code is available at https://github.com/JustinHsu1019/KAP.
title KAP: MLLM-assisted OCR Text Enhancement for Hybrid Retrieval in Chinese Non-Narrative Documents
topic Information Retrieval
url https://arxiv.org/abs/2503.08452