Saved in:
| Main Authors: | Dent, Rasul, Suarez, Pedro Ortiz, Clérice, Thibault, Sagot, Benoît |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.06547 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
How Should We Model the Probability of a Language?
by: Dent, Rasul, et al.
Published: (2026)
by: Dent, Rasul, et al.
Published: (2026)
Molyé: A Corpus-based Approach to Language Contact in Colonial France
by: Dent, Rasul, et al.
Published: (2024)
by: Dent, Rasul, et al.
Published: (2024)
Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions
by: Karamolegkou, Antonia, et al.
Published: (2026)
by: Karamolegkou, Antonia, et al.
Published: (2026)
Detecting Sexual Content at the Sentence Level in First Millennium Latin Texts
by: Clérice, Thibault
Published: (2023)
by: Clérice, Thibault
Published: (2023)
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data
by: Suarez, Pedro Ortiz, et al.
Published: (2026)
by: Suarez, Pedro Ortiz, et al.
Published: (2026)
GlotLID: Language Identification for Low-Resource Languages
by: Kargaran, Amir Hossein, et al.
Published: (2023)
by: Kargaran, Amir Hossein, et al.
Published: (2023)
From Text to Source: Results in Detecting Large Language Model-Generated Content
by: Antoun, Wissam, et al.
Published: (2023)
by: Antoun, Wissam, et al.
Published: (2023)
Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?
by: Riabi, Arij, et al.
Published: (2021)
by: Riabi, Arij, et al.
Published: (2021)
MaskLID: Code-Switching Language Identification through Iterative Masking
by: Kargaran, Amir Hossein, et al.
Published: (2024)
by: Kargaran, Amir Hossein, et al.
Published: (2024)
BERT-LID: Leveraging BERT to Improve Spoken Language Identification
by: Nie, Yuting, et al.
Published: (2022)
by: Nie, Yuting, et al.
Published: (2022)
Language-Switching Triggers Take a Latent Detour Through Language Models
by: Kulumba, Francis, et al.
Published: (2026)
by: Kulumba, Francis, et al.
Published: (2026)
You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine
by: Clérice, Thibault
Published: (2022)
by: Clérice, Thibault
Published: (2022)
On the Scaling Laws of Geographical Representation in Language Models
by: Godey, Nathan, et al.
Published: (2024)
by: Godey, Nathan, et al.
Published: (2024)
ConLID: Supervised Contrastive Learning for Low-Resource Language Identification
by: Foroutan, Negar, et al.
Published: (2025)
by: Foroutan, Negar, et al.
Published: (2025)
Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck
by: Godey, Nathan, et al.
Published: (2024)
by: Godey, Nathan, et al.
Published: (2024)
Towards Zero-Shot Multimodal Machine Translation
by: Futeral, Matthieu, et al.
Published: (2024)
by: Futeral, Matthieu, et al.
Published: (2024)
Pre-Editorial Normalization for Automatically Transcribed Medieval Manuscripts in Old French and Latin
by: Clérice, Thibault, et al.
Published: (2026)
by: Clérice, Thibault, et al.
Published: (2026)
OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report
by: Fedorova, Mariia, et al.
Published: (2026)
by: Fedorova, Mariia, et al.
Published: (2026)
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
by: Futeral, Matthieu, et al.
Published: (2024)
by: Futeral, Matthieu, et al.
Published: (2024)
A French Version of the OLDI Seed Corpus
by: Marmonier, Malik, et al.
Published: (2025)
by: Marmonier, Malik, et al.
Published: (2025)
ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance
by: Antoun, Wissam, et al.
Published: (2025)
by: Antoun, Wissam, et al.
Published: (2025)
LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens
by: Zebaze, Armel, et al.
Published: (2025)
by: Zebaze, Armel, et al.
Published: (2025)
Explicit Learning and the LLM in Machine Translation
by: Marmonier, Malik, et al.
Published: (2025)
by: Marmonier, Malik, et al.
Published: (2025)
TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation
by: Zebaze, Armel, et al.
Published: (2025)
by: Zebaze, Armel, et al.
Published: (2025)
Compositional Translation: A Novel LLM-based Approach for Low-resource Machine Translation
by: Zebaze, Armel, et al.
Published: (2025)
by: Zebaze, Armel, et al.
Published: (2025)
Tree of Problems: Improving structured problem solving with compositionality
by: Zebaze, Armel, et al.
Published: (2024)
by: Zebaze, Armel, et al.
Published: (2024)
Testing the Deliteralization Hypothesis in Human and Machine Translation
by: Marmonier, Malik, et al.
Published: (2026)
by: Marmonier, Malik, et al.
Published: (2026)
In-Context Example Selection via Similarity Search Improves Low-Resource Machine Translation
by: Zebaze, Armel, et al.
Published: (2024)
by: Zebaze, Armel, et al.
Published: (2024)
Hindsight Quality Prediction Experiments in Multi-Candidate Human-Post-Edited Machine Translation
by: Marmonier, Malik, et al.
Published: (2026)
by: Marmonier, Malik, et al.
Published: (2026)
Making Sentence Embeddings Robust to User-Generated Content
by: Nishimwe, Lydia, et al.
Published: (2024)
by: Nishimwe, Lydia, et al.
Published: (2024)
Diachronic Document Dataset for Semantic Layout Analysis
by: Clérice, Thibault, et al.
Published: (2024)
by: Clérice, Thibault, et al.
Published: (2024)
When your Cousin has the Right Connections: Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages
by: Bafna, Niyati, et al.
Published: (2023)
by: Bafna, Niyati, et al.
Published: (2023)
CamemBERT 2.0: A Smarter French Language Model Aged to Perfection
by: Antoun, Wissam, et al.
Published: (2024)
by: Antoun, Wissam, et al.
Published: (2024)
Anisotropy Is Inherent to Self-Attention in Transformers
by: Godey, Nathan, et al.
Published: (2024)
by: Godey, Nathan, et al.
Published: (2024)
LID Models are Actually Accent Classifiers: Implications and Solutions for LID on Accented Speech
by: Bafna, Niyati, et al.
Published: (2025)
by: Bafna, Niyati, et al.
Published: (2025)
BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?
by: Chambon, Pierre, et al.
Published: (2025)
by: Chambon, Pierre, et al.
Published: (2025)
Gaperon: A Peppered English-French Generative Language Model Suite
by: Godey, Nathan, et al.
Published: (2025)
by: Godey, Nathan, et al.
Published: (2025)
Disentangling meaning from language in LLM-based machine translation
by: Lasnier, Théo, et al.
Published: (2026)
by: Lasnier, Théo, et al.
Published: (2026)
Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation
by: Gurgurov, Daniil, et al.
Published: (2025)
by: Gurgurov, Daniil, et al.
Published: (2025)
Patent Representation Learning via Self-supervision
by: Zuo, You, et al.
Published: (2025)
by: Zuo, You, et al.
Published: (2025)
Similar Items
-
How Should We Model the Probability of a Language?
by: Dent, Rasul, et al.
Published: (2026) -
Molyé: A Corpus-based Approach to Language Contact in Colonial France
by: Dent, Rasul, et al.
Published: (2024) -
Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions
by: Karamolegkou, Antonia, et al.
Published: (2026) -
Detecting Sexual Content at the Sentence Level in First Millennium Latin Texts
by: Clérice, Thibault
Published: (2023) -
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data
by: Suarez, Pedro Ortiz, et al.
Published: (2026)