Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Piryani, Bhawna, Mozafari, Jamshid, Abdallah, Abdelrahman, Doucet, Antoine, Jatowt, Adam
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2502.16781
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909796645470208
author	Piryani, Bhawna Mozafari, Jamshid Abdallah, Abdelrahman Doucet, Antoine Jatowt, Adam
author_facet	Piryani, Bhawna Mozafari, Jamshid Abdallah, Abdelrahman Doucet, Antoine Jatowt, Adam
contents	Optical Character Recognition (OCR) plays a crucial role in digitizing historical and multilingual documents, yet OCR errors - imperfect extraction of text, including character insertion, deletion, and substitution can significantly impact downstream tasks like question-answering (QA). In this work, we conduct a comprehensive analysis of how OCR-induced noise affects the performance of Multilingual QA Systems. To support this analysis, we introduce a multilingual QA dataset MultiOCR-QA, comprising 50K question-answer pairs across three languages, English, French, and German. The dataset is curated from OCR-ed historical documents, which include different levels and types of OCR noise. We then evaluate how different state-of-the-art Large Language Models (LLMs) perform under different error conditions, focusing on three major OCR error types. Our findings show that QA systems are highly prone to OCR-induced errors and perform poorly on noisy OCR text. By comparing model performance on clean versus noisy texts, we provide insights into the limitations of current approaches and emphasize the need for more noise-resilient QA systems in historical digitization contexts.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_16781
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Evaluating Robustness of LLMs in Question Answering on Multilingual Noisy OCR Data Piryani, Bhawna Mozafari, Jamshid Abdallah, Abdelrahman Doucet, Antoine Jatowt, Adam Computation and Language Optical Character Recognition (OCR) plays a crucial role in digitizing historical and multilingual documents, yet OCR errors - imperfect extraction of text, including character insertion, deletion, and substitution can significantly impact downstream tasks like question-answering (QA). In this work, we conduct a comprehensive analysis of how OCR-induced noise affects the performance of Multilingual QA Systems. To support this analysis, we introduce a multilingual QA dataset MultiOCR-QA, comprising 50K question-answer pairs across three languages, English, French, and German. The dataset is curated from OCR-ed historical documents, which include different levels and types of OCR noise. We then evaluate how different state-of-the-art Large Language Models (LLMs) perform under different error conditions, focusing on three major OCR error types. Our findings show that QA systems are highly prone to OCR-induced errors and perform poorly on noisy OCR text. By comparing model performance on clean versus noisy texts, we provide insights into the limitations of current approaches and emphasize the need for more noise-resilient QA systems in historical digitization contexts.
title	Evaluating Robustness of LLMs in Question Answering on Multilingual Noisy OCR Data
topic	Computation and Language
url	https://arxiv.org/abs/2502.16781

Similar Items