Saved in:
Bibliographic Details
Main Authors: Khelil, Cherifa Ben, Antoine, Jean-Yves, Halftermeyer, Anaïs, Rayar, Frédéric, Thebaud, Mathieu
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.05899
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911573754249216
author Khelil, Cherifa Ben
Antoine, Jean-Yves
Halftermeyer, Anaïs
Rayar, Frédéric
Thebaud, Mathieu
author_facet Khelil, Cherifa Ben
Antoine, Jean-Yves
Halftermeyer, Anaïs
Rayar, Frédéric
Thebaud, Mathieu
contents In this paper, we introduce the French-YMCA corpus, a new linguistic resource specifically tailored for children and adolescents. The motivation for building this corpus is clear: children have unique language requirements, as their language skills are in constant evolution and differ from those of adults. With an extensive collection of 39,200 text files, the French-YMCA corpus encompasses a total of 22,471,898 words. It distinguishes itself through its diverse sources, consistent grammar and spelling, and the commitment to providing open online accessibility for all. Such corpus can serve as the foundation for training language models that understand and anticipate youth's language, thereby enhancing the quality of digital interactions and ensuring that responses and suggestions are age-appropriate and adapted to the comprehension level of users of this age.
format Preprint
id arxiv_https___arxiv_org_abs_2604_05899
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle FRENCH-YMCA: A FRENCH Corpus meeting the language needs of Youth, froM Children to Adolescents
Khelil, Cherifa Ben
Antoine, Jean-Yves
Halftermeyer, Anaïs
Rayar, Frédéric
Thebaud, Mathieu
Computation and Language
In this paper, we introduce the French-YMCA corpus, a new linguistic resource specifically tailored for children and adolescents. The motivation for building this corpus is clear: children have unique language requirements, as their language skills are in constant evolution and differ from those of adults. With an extensive collection of 39,200 text files, the French-YMCA corpus encompasses a total of 22,471,898 words. It distinguishes itself through its diverse sources, consistent grammar and spelling, and the commitment to providing open online accessibility for all. Such corpus can serve as the foundation for training language models that understand and anticipate youth's language, thereby enhancing the quality of digital interactions and ensuring that responses and suggestions are age-appropriate and adapted to the comprehension level of users of this age.
title FRENCH-YMCA: A FRENCH Corpus meeting the language needs of Youth, froM Children to Adolescents
topic Computation and Language
url https://arxiv.org/abs/2604.05899