:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Elboher, Yair, Pinter, Yuval
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2510.26521
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Protecting Privacy in Classifiers by Token Manipulation
by: Harel, Re'em, et al.
Published: (2024)

Don't Touch My Diacritics
by: Gorman, Kyle, et al.
Published: (2024)

The Degree of Language Diacriticity and Its Effect on Tasks
by: Cohen, Adi, et al.
Published: (2026)

D-Nikud: Enhancing Hebrew Diacritization with LSTM and Pretrained Models
by: Rosenthal, Adi, et al.
Published: (2024)

A Language Modeling Approach to Diacritic-Free Hebrew TTS
by: Roth, Amit, et al.
Published: (2024)

BiVert: Bidirectional Vocabulary Evaluation using Relations for Machine Translation
by: Cherf, Carinne, et al.
Published: (2024)

Automatic Restoration of Diacritics for Speech Data Sets
by: Shatnawi, Sara, et al.
Published: (2023)

Corpus-Based Approaches to Igbo Diacritic Restoration
by: Ezeani, Ignatius
Published: (2026)

Probing Subphonemes in Morphology Models
by: Astrach, Gal, et al.
Published: (2025)

Information Types in Product Reviews
by: Shapira, Ori, et al.
Published: (2025)

CharBench: Evaluating the Role of Tokenization in Character-Level Tasks
by: Uzan, Omri, et al.
Published: (2025)

Interplay of Machine Translation, Diacritics, and Diacritization
by: Chen, Wei-Rui, et al.
Published: (2024)

Which Pieces Does Unigram Tokenization Really Need?
by: Land, Sander, et al.
Published: (2025)

Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations
by: Ghannam, Ahmad, et al.
Published: (2025)

Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization
by: Elgamal, Salman, et al.
Published: (2024)

Mevaker: Conclusion Extraction and Allocation Resources for the Hebrew Language
by: Shalumov, Vitaly, et al.
Published: (2024)

Evaluating Large Language Models for Diacritic Restoration in Romanian Texts: A Comparative Study
by: Nadas, Mihai, et al.
Published: (2025)

Faster Superword Tokenization
by: Schmidt, Craig W., et al.
Published: (2026)

Diacritic Restoration for Low-Resource Indigenous Languages: Case Study with Bribri and Cook Islands Māori
by: Coto-Solano, Rolando, et al.
Published: (2025)

Token-Level Privacy in Large Language Models
by: Harel, Re'em, et al.
Published: (2025)

Splintering Nonconcatenative Languages for Better Tokenization
by: Gazit, Bar, et al.
Published: (2025)

Greed is All You Need: An Evaluation of Tokenizer Inference Methods
by: Uzan, Omri, et al.
Published: (2024)

How Much is Enough? The Diminishing Returns of Tokenization Training Data
by: Reddy, Varshini, et al.
Published: (2025)

An Analysis of BPE Vocabulary Trimming in Neural Machine Translation
by: Cognetta, Marco, et al.
Published: (2024)

More Data, Fewer Diacritics: Scaling Arabic TTS
by: Musleh, Ahmed, et al.
Published: (2026)

Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset
by: Bondok, Rawan, et al.
Published: (2025)

YAD: Leveraging T5 for Improved Automatic Diacritization of Yorùbá Text
by: Olawole, Akindele Michael, et al.
Published: (2024)

A Context-Contrastive Inference Approach To Partial Diacritization
by: ElNokrashy, Muhammad, et al.
Published: (2024)

Are LLMs Good Text Diacritizers? An Arabic and Yoruba Case Study
by: Toyin, Hawau Olamide, et al.
Published: (2025)

Sadeed: Advancing Arabic Diacritization Through Small Language Model
by: Aldallal, Zeina, et al.
Published: (2025)

The Effect of Scripts and Formats on LLM Numeracy
by: Reddy, Varshini, et al.
Published: (2026)

Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier
by: Schmidt, Craig W., et al.
Published: (2025)

Arabic Text Diacritization In The Age Of Transfer Learning: Token Classification Is All You Need
by: Skiredj, Abderrahman, et al.
Published: (2024)

MenakBERT -- Hebrew Diacriticizer
by: Cohen, Ido, et al.
Published: (2024)

MRL Parsing Without Tears: The Case of Hebrew
by: Shmidman, Shaltiel, et al.
Published: (2024)

Tokenization with Split Trees
by: Schmidt, Craig W., et al.
Published: (2026)

VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling
by: Marmor, Yanir, et al.
Published: (2026)

Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge
by: Batsuren, Khuyagbaatar, et al.
Published: (2024)

Reference-Based Post-OCR Processing with LLM for Precise Diacritic Text in Historical Document Recognition
by: Do, Thao, et al.
Published: (2024)

OMPar: Automatic Parallelization with AI-Driven Source-to-Source Compilation
by: Kadosh, Tal, et al.
Published: (2024)