Salvato in:
Dettagli Bibliografici
Autori principali: Tschirschwitz, David, Rodehorst, Volker
Natura: Preprint
Pubblicazione: 2025
Soggetti:
Accesso online:https://arxiv.org/abs/2501.15469
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866912205347225600
author Tschirschwitz, David
Rodehorst, Volker
author_facet Tschirschwitz, David
Rodehorst, Volker
contents Reproducibility and replicability are critical pillars of empirical research, particularly in machine learning, where they depend not only on the availability of models, but also on the datasets used to train and evaluate those models. In this paper, we introduce the Construction Industry Steel Ordering List (CISOL) dataset, which was developed with a focus on transparency to ensure reproducibility, replicability, and extensibility. CISOL provides a valuable new research resource and highlights the importance of having diverse datasets, even in niche application domains such as table extraction in civil engineering. CISOL is unique in that it contains real-world civil engineering documents from industry, making it a distinctive contribution to the field. The dataset contains more than 120,000 annotated instances in over 800 document images, positioning it as a medium-sized dataset that provides a robust foundation for Table Structure Recognition (TSR) and Table Detection (TD) tasks. Benchmarking results show that CISOL achieves 67.22 mAP@0.5:0.95:0.05 using the YOLOv8 model, outperforming the TSR-specific TATR model. This highlights the effectiveness of CISOL as a benchmark for advancing TSR, especially in specialized domains.
format Preprint
id arxiv_https___arxiv_org_abs_2501_15469
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle CISOL: An Open and Extensible Dataset for Table Structure Recognition in the Construction Industry
Tschirschwitz, David
Rodehorst, Volker
Computer Vision and Pattern Recognition
Reproducibility and replicability are critical pillars of empirical research, particularly in machine learning, where they depend not only on the availability of models, but also on the datasets used to train and evaluate those models. In this paper, we introduce the Construction Industry Steel Ordering List (CISOL) dataset, which was developed with a focus on transparency to ensure reproducibility, replicability, and extensibility. CISOL provides a valuable new research resource and highlights the importance of having diverse datasets, even in niche application domains such as table extraction in civil engineering. CISOL is unique in that it contains real-world civil engineering documents from industry, making it a distinctive contribution to the field. The dataset contains more than 120,000 annotated instances in over 800 document images, positioning it as a medium-sized dataset that provides a robust foundation for Table Structure Recognition (TSR) and Table Detection (TD) tasks. Benchmarking results show that CISOL achieves 67.22 mAP@0.5:0.95:0.05 using the YOLOv8 model, outperforming the TSR-specific TATR model. This highlights the effectiveness of CISOL as a benchmark for advancing TSR, especially in specialized domains.
title CISOL: An Open and Extensible Dataset for Table Structure Recognition in the Construction Industry
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2501.15469