Saved in:
| Main Authors: | Elboher, Yair, Pinter, Yuval |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.26521 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Protecting Privacy in Classifiers by Token Manipulation
by: Harel, Re'em, et al.
Published: (2024)
by: Harel, Re'em, et al.
Published: (2024)
Don't Touch My Diacritics
by: Gorman, Kyle, et al.
Published: (2024)
by: Gorman, Kyle, et al.
Published: (2024)
The Degree of Language Diacriticity and Its Effect on Tasks
by: Cohen, Adi, et al.
Published: (2026)
by: Cohen, Adi, et al.
Published: (2026)
D-Nikud: Enhancing Hebrew Diacritization with LSTM and Pretrained Models
by: Rosenthal, Adi, et al.
Published: (2024)
by: Rosenthal, Adi, et al.
Published: (2024)
A Language Modeling Approach to Diacritic-Free Hebrew TTS
by: Roth, Amit, et al.
Published: (2024)
by: Roth, Amit, et al.
Published: (2024)
BiVert: Bidirectional Vocabulary Evaluation using Relations for Machine Translation
by: Cherf, Carinne, et al.
Published: (2024)
by: Cherf, Carinne, et al.
Published: (2024)
Automatic Restoration of Diacritics for Speech Data Sets
by: Shatnawi, Sara, et al.
Published: (2023)
by: Shatnawi, Sara, et al.
Published: (2023)
Corpus-Based Approaches to Igbo Diacritic Restoration
by: Ezeani, Ignatius
Published: (2026)
by: Ezeani, Ignatius
Published: (2026)
Probing Subphonemes in Morphology Models
by: Astrach, Gal, et al.
Published: (2025)
by: Astrach, Gal, et al.
Published: (2025)
Information Types in Product Reviews
by: Shapira, Ori, et al.
Published: (2025)
by: Shapira, Ori, et al.
Published: (2025)
CharBench: Evaluating the Role of Tokenization in Character-Level Tasks
by: Uzan, Omri, et al.
Published: (2025)
by: Uzan, Omri, et al.
Published: (2025)
Interplay of Machine Translation, Diacritics, and Diacritization
by: Chen, Wei-Rui, et al.
Published: (2024)
by: Chen, Wei-Rui, et al.
Published: (2024)
Which Pieces Does Unigram Tokenization Really Need?
by: Land, Sander, et al.
Published: (2025)
by: Land, Sander, et al.
Published: (2025)
Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations
by: Ghannam, Ahmad, et al.
Published: (2025)
by: Ghannam, Ahmad, et al.
Published: (2025)
Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization
by: Elgamal, Salman, et al.
Published: (2024)
by: Elgamal, Salman, et al.
Published: (2024)
Mevaker: Conclusion Extraction and Allocation Resources for the Hebrew Language
by: Shalumov, Vitaly, et al.
Published: (2024)
by: Shalumov, Vitaly, et al.
Published: (2024)
Evaluating Large Language Models for Diacritic Restoration in Romanian Texts: A Comparative Study
by: Nadas, Mihai, et al.
Published: (2025)
by: Nadas, Mihai, et al.
Published: (2025)
Faster Superword Tokenization
by: Schmidt, Craig W., et al.
Published: (2026)
by: Schmidt, Craig W., et al.
Published: (2026)
Diacritic Restoration for Low-Resource Indigenous Languages: Case Study with Bribri and Cook Islands Māori
by: Coto-Solano, Rolando, et al.
Published: (2025)
by: Coto-Solano, Rolando, et al.
Published: (2025)
Token-Level Privacy in Large Language Models
by: Harel, Re'em, et al.
Published: (2025)
by: Harel, Re'em, et al.
Published: (2025)
Splintering Nonconcatenative Languages for Better Tokenization
by: Gazit, Bar, et al.
Published: (2025)
by: Gazit, Bar, et al.
Published: (2025)
Greed is All You Need: An Evaluation of Tokenizer Inference Methods
by: Uzan, Omri, et al.
Published: (2024)
by: Uzan, Omri, et al.
Published: (2024)
How Much is Enough? The Diminishing Returns of Tokenization Training Data
by: Reddy, Varshini, et al.
Published: (2025)
by: Reddy, Varshini, et al.
Published: (2025)
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation
by: Cognetta, Marco, et al.
Published: (2024)
by: Cognetta, Marco, et al.
Published: (2024)
More Data, Fewer Diacritics: Scaling Arabic TTS
by: Musleh, Ahmed, et al.
Published: (2026)
by: Musleh, Ahmed, et al.
Published: (2026)
Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset
by: Bondok, Rawan, et al.
Published: (2025)
by: Bondok, Rawan, et al.
Published: (2025)
YAD: Leveraging T5 for Improved Automatic Diacritization of Yorùbá Text
by: Olawole, Akindele Michael, et al.
Published: (2024)
by: Olawole, Akindele Michael, et al.
Published: (2024)
A Context-Contrastive Inference Approach To Partial Diacritization
by: ElNokrashy, Muhammad, et al.
Published: (2024)
by: ElNokrashy, Muhammad, et al.
Published: (2024)
Are LLMs Good Text Diacritizers? An Arabic and Yoruba Case Study
by: Toyin, Hawau Olamide, et al.
Published: (2025)
by: Toyin, Hawau Olamide, et al.
Published: (2025)
Sadeed: Advancing Arabic Diacritization Through Small Language Model
by: Aldallal, Zeina, et al.
Published: (2025)
by: Aldallal, Zeina, et al.
Published: (2025)
The Effect of Scripts and Formats on LLM Numeracy
by: Reddy, Varshini, et al.
Published: (2026)
by: Reddy, Varshini, et al.
Published: (2026)
Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier
by: Schmidt, Craig W., et al.
Published: (2025)
by: Schmidt, Craig W., et al.
Published: (2025)
Arabic Text Diacritization In The Age Of Transfer Learning: Token Classification Is All You Need
by: Skiredj, Abderrahman, et al.
Published: (2024)
by: Skiredj, Abderrahman, et al.
Published: (2024)
MenakBERT -- Hebrew Diacriticizer
by: Cohen, Ido, et al.
Published: (2024)
by: Cohen, Ido, et al.
Published: (2024)
MRL Parsing Without Tears: The Case of Hebrew
by: Shmidman, Shaltiel, et al.
Published: (2024)
by: Shmidman, Shaltiel, et al.
Published: (2024)
Tokenization with Split Trees
by: Schmidt, Craig W., et al.
Published: (2026)
by: Schmidt, Craig W., et al.
Published: (2026)
VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling
by: Marmor, Yanir, et al.
Published: (2026)
by: Marmor, Yanir, et al.
Published: (2026)
Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge
by: Batsuren, Khuyagbaatar, et al.
Published: (2024)
by: Batsuren, Khuyagbaatar, et al.
Published: (2024)
Reference-Based Post-OCR Processing with LLM for Precise Diacritic Text in Historical Document Recognition
by: Do, Thao, et al.
Published: (2024)
by: Do, Thao, et al.
Published: (2024)
OMPar: Automatic Parallelization with AI-Driven Source-to-Source Compilation
by: Kadosh, Tal, et al.
Published: (2024)
by: Kadosh, Tal, et al.
Published: (2024)
Similar Items
-
Protecting Privacy in Classifiers by Token Manipulation
by: Harel, Re'em, et al.
Published: (2024) -
Don't Touch My Diacritics
by: Gorman, Kyle, et al.
Published: (2024) -
The Degree of Language Diacriticity and Its Effect on Tasks
by: Cohen, Adi, et al.
Published: (2026) -
D-Nikud: Enhancing Hebrew Diacritization with LSTM and Pretrained Models
by: Rosenthal, Adi, et al.
Published: (2024) -
A Language Modeling Approach to Diacritic-Free Hebrew TTS
by: Roth, Amit, et al.
Published: (2024)