I tiakina i:
| Kaituhi matua: | |
|---|---|
| Hōputu: | Recurso digital |
| Reo: | Wīwī |
| I whakaputaina: |
Zenodo
2026
|
| Ngā marau: | |
| Urunga tuihono: | https://doi.org/10.5281/zenodo.18745702 |
| Ngā Tūtohu: |
Tāpirihia he Tūtohu
Kāore He Tūtohu, Me noho koe te mea tuatahi ki te tūtohu i tēnei pūkete!
|
Rārangi ihirangi:
- <p>This repository contains the extended version of the ground truth for the codex <a href="https://gallica.bnf.fr/ark:/12148/btv1b84472995"><strong>Paris, BnF, fr. 2813</strong></a>, used in the experiments for the paper <em>“</em>Leveraging Morphology for Metrological Historical Script Analysis<em>”</em>, accepted to <a href="https://icdar2026.org/"><strong>International Conference on Document Analysis and Recognition</strong></a><strong> (ICDAR 2026, Vienna, Austria)</strong>.</p> <h3><strong>What’s New Compared to v.1</strong></h3> <ul> <li> <p><strong>95 newly annotated folios</strong> have been added (see the new btv1b84472995_metadata.csv for details);</p> </li> <li> <p>The <strong>ALTO XML annotations</strong> now distinguish between #MainZone#1 and #MainZone#2, corresponding to the column order on each page;</p> </li> <li> <p>Two versions of annotation.json are provided: one version includes hyphenation for word breaks at the end of lines.</p> </li> </ul> <p> </p> <p>As for version 1, the repository is organized into two main data folders:</p> <p>---</p> <h2> `btv1b84472995_GT.zip`</h2> <p>This folder contains the <em>ground truth</em> dataset used for Handwritten Text Recognition (HTR), created from the selected folia of the manuscript Paris, BnF, français. 2813 <br>The identifier <strong>`btv1b84472995`</strong> refers to the ark ID of this manuscript in <a href="https://gallica.bnf.fr/ark:/12148/btv1b84472995" rel="noopener">Gallica</a>.</p> <p><strong>Folder structure:</strong></p> <p>btv1b84472995_GT</p> <p>├── images</p> <p>└── annotations<br><br></p> <p>- `images/`: High-resolution selected images downloaded from Gallica. <br> Image names follow the pattern `btv1b84472995_f<number>`, corresponding to the Gallica view number. <br> ➤ Credit: *Source gallica.bnf.fr / Bibliothèque nationale de France*</p> <p>- `annotations/`: XML-ALTO annotation files created with <a href="https://escriptorium.inria.fr/">eScriptorium</a>.</p> <p>Layout: Annotations follow the <a href="https://segmonto.github.io/">Segmonto</a> ontology. The potential users of the ground truth should note that we use additional personalized tags for:<br> - `'RubricLines'`: Rubricated lines<br> - `'HalfLines'`: Partial or incomplete lines</p> <p> - `'MainZone#1'` and `'MainZone#2'`: order of the column, instead of simply #MainZone </p> <p>Transcription: The dataset is <a href="https://zenodo.org/records/12743230">CATMuS</a>-compliant, using a graphemic transcription approach.</p> <p>---</p> <h2> `dataset.zip`</h2> <p>This folder contains the dataset used in the experiments described in the paper, using the DTLR architecture for paleography, as detailed in the paper.</p> <p><strong>Folder structure:</strong></p> <p>dataset</p> <p>├── images</p> <p>└── annotation.json<br><br></p> <p>- `images/`: Each subfolder contains polygonal line extractions (with alpha transparency) per manuscript page.<br>- `annotation.json`: Contains the annotation and metadata for each line.</p> <p>`annotation.json` structure example:</p> <p>```json<br>"<image_id>": { // corresponds to the image names in the images folders<br> "split": "train", <br> "label": "A beautiful calico cat.",// Transcription text of the line</p> <p> "line": "DefaultLine", // Type of line<br> "zone": "MainZone#1", // Type of Zone where the line is found</p> <p> "script": "RaouletOrleans", // Identifier for the scribal hand<br> "folio": "1r", <br> "gp": "GP1", // Identified Graphic Profile <br> "doc": "HT1",</p> <p> <br>}</p> <p>Papers associated with the data:</p> <p>v1: https://malamatenia.github.io/bnf-fr-2813/ (Scriptorium 2026)</p> <p>v2: https://malamatenia.github.io/dtlr-for-metrology/ (ICDAR 2026)<br><br>This study was supported by the CNRS through MITI and the 80|Prime program (CrEMe Caractérisation des écritures médiévales), and by the European Research Council (ERC project DISCOVER, number 101076028).</p>