I tiakina i:
Ngā taipitopito rārangi puna kōrero
Kaituhi matua: Vlachou Efstathiou, Malamatenia
Hōputu: Recurso digital
Reo:Wīwī
I whakaputaina: Zenodo 2026
Ngā marau:
Urunga tuihono:https://doi.org/10.5281/zenodo.18745702
Ngā Tūtohu: Tāpirihia he Tūtohu
Kāore He Tūtohu, Me noho koe te mea tuatahi ki te tūtohu i tēnei pūkete!
Rārangi ihirangi:
  • <p>This repository contains the extended version of the ground truth for the codex <a href="https://gallica.bnf.fr/ark:/12148/btv1b84472995"><strong>Paris, BnF, fr. 2813</strong></a>, used in the experiments for the paper <em>“</em>Leveraging Morphology for Metrological Historical Script Analysis<em>”</em>, accepted to <a href="https://icdar2026.org/"><strong>International Conference on Document Analysis and Recognition</strong></a><strong> (ICDAR 2026, Vienna, Austria)</strong>.</p> <h3><strong>What’s New Compared to v.1</strong></h3> <ul> <li> <p><strong>95 newly annotated folios</strong> have been added (see the new btv1b84472995_metadata.csv for details);</p> </li> <li> <p>The <strong>ALTO XML annotations</strong> now distinguish between #MainZone#1 and #MainZone#2, corresponding to the column order on each page;</p> </li> <li> <p>Two versions of annotation.json are provided: one version includes hyphenation for word breaks at the end of lines.</p> </li> </ul> <p> </p> <p>As for version 1, the repository is organized into two main data folders:</p> <p>---</p> <h2> `btv1b84472995_GT.zip`</h2> <p>This folder contains the <em>ground truth</em> dataset used for Handwritten Text Recognition (HTR), created from the selected folia of the manuscript Paris, BnF, français. 2813 <br>The identifier <strong>`btv1b84472995`</strong> refers to the ark ID of this manuscript in <a href="https://gallica.bnf.fr/ark:/12148/btv1b84472995" rel="noopener">Gallica</a>.</p> <p><strong>Folder structure:</strong></p> <p>btv1b84472995_GT</p> <p>├── images</p> <p>└── annotations<br><br></p> <p>- `images/`: High-resolution selected images downloaded from Gallica.  <br>  Image names follow the pattern `btv1b84472995_f<number>`, corresponding to the Gallica view number.  <br>  ➤ Credit: *Source gallica.bnf.fr / Bibliothèque nationale de France*</p> <p>- `annotations/`: XML-ALTO annotation files created with <a href="https://escriptorium.inria.fr/">eScriptorium</a>.</p> <p>Layout: Annotations follow the <a href="https://segmonto.github.io/">Segmonto</a> ontology.  The potential users of the ground truth should note that we use additional personalized tags for:<br>  - `'RubricLines'`: Rubricated lines<br>  - `'HalfLines'`: Partial or incomplete lines</p> <p>  - `'MainZone#1'` and  `'MainZone#2'`: order of the column, instead of simply #MainZone </p> <p>Transcription: The dataset is <a href="https://zenodo.org/records/12743230">CATMuS</a>-compliant, using a graphemic transcription approach.</p> <p>---</p> <h2>  `dataset.zip`</h2> <p>This folder contains the dataset used in the experiments described in the paper, using the DTLR architecture for paleography, as detailed in the paper.</p> <p><strong>Folder structure:</strong></p> <p>dataset</p> <p>├── images</p> <p>└── annotation.json<br><br></p> <p>- `images/`: Each subfolder contains polygonal line extractions (with alpha transparency) per manuscript page.<br>- `annotation.json`: Contains the annotation and metadata for each line.</p> <p>`annotation.json` structure example:</p> <p>```json<br>"<image_id>": {                      // corresponds to the image names in the images folders<br>  "split": "train",          <br>  "label": "A beautiful calico cat.",// Transcription text of the line</p> <p>    "line": "DefaultLine", // Type of line<br>    "zone": "MainZone#1", // Type of Zone where the line is found</p> <p> "script": "RaouletOrleans",       // Identifier for the scribal hand<br>  "folio": "1r",                    <br>  "gp": "GP1",     // Identified Graphic Profile                <br>  "doc": "HT1",</p> <p>                     <br>}</p> <p>Papers associated with the data:</p> <p>v1: https://malamatenia.github.io/bnf-fr-2813/ (Scriptorium 2026)</p> <p>v2: https://malamatenia.github.io/dtlr-for-metrology/  (ICDAR 2026)<br><br>This study was supported by the CNRS through MITI and the 80|Prime program (CrEMe Caractérisation des écritures médiévales), and by the European Research Council (ERC project DISCOVER, number 101076028).</p>