Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Bourne, Jonathan
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2409.19735
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916414689902592
author	Bourne, Jonathan
author_facet	Bourne, Jonathan
contents	OCR errors are common in digitised historical archives significantly affecting their usability and value. Generative Language Models (LMs) have shown potential for correcting these errors using the context provided by the corrupted text and the broader socio-cultural context, a process called Context Leveraging OCR Correction (CLOCR-C). However, getting sufficient training data for fine-tuning such models can prove challenging. This paper shows that fine-tuning a language model on synthetic data using an LM and using a character level Markov corruption process can significantly improve the ability to correct OCR errors. Models trained on synthetic data reduce the character error rate by 55% and word error rate by 32% over the base LM and outperform models trained on real data. Key findings include; training on under-corrupted data is better than over-corrupted data; non-uniform character level corruption is better than uniform corruption; More tokens-per-observation outperforms more observations for a fixed token budget. The outputs for this paper are a set of 8 heuristics for training effective CLOCR-C models, a dataset of 11,000 synthetic 19th century newspaper articles and scrambledtext a python library for creating synthetic corrupted data.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_19735
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Scrambled text: training Language Models to correct OCR errors using synthetic data Bourne, Jonathan Computation and Language OCR errors are common in digitised historical archives significantly affecting their usability and value. Generative Language Models (LMs) have shown potential for correcting these errors using the context provided by the corrupted text and the broader socio-cultural context, a process called Context Leveraging OCR Correction (CLOCR-C). However, getting sufficient training data for fine-tuning such models can prove challenging. This paper shows that fine-tuning a language model on synthetic data using an LM and using a character level Markov corruption process can significantly improve the ability to correct OCR errors. Models trained on synthetic data reduce the character error rate by 55% and word error rate by 32% over the base LM and outperform models trained on real data. Key findings include; training on under-corrupted data is better than over-corrupted data; non-uniform character level corruption is better than uniform corruption; More tokens-per-observation outperforms more observations for a fixed token budget. The outputs for this paper are a set of 8 heuristics for training effective CLOCR-C models, a dataset of 11,000 synthetic 19th century newspaper articles and scrambledtext a python library for creating synthetic corrupted data.
title	Scrambled text: training Language Models to correct OCR errors using synthetic data
topic	Computation and Language
url	https://arxiv.org/abs/2409.19735

Similar Items