Saved in:
| Main Authors: | Wicks, Rachel, Ravisankar, Kartik, Yang, Xinchen, Koehn, Philipp, Post, Matt |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.21265 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Recovering document annotations for sentence-level bitext
by: Wicks, Rachel, et al.
Published: (2024)
by: Wicks, Rachel, et al.
Published: (2024)
Can you map it to English? The Role of Cross-Lingual Alignment in Multilingual Performance of LLMs
by: Ravisankar, Kartik, et al.
Published: (2025)
by: Ravisankar, Kartik, et al.
Published: (2025)
Escaping the sentence-level paradigm in machine translation
by: Post, Matt, et al.
Published: (2023)
by: Post, Matt, et al.
Published: (2023)
Speech Vecalign: an Embedding-based Method for Aligning Parallel Speech Documents
by: Meng, Chutong, et al.
Published: (2025)
by: Meng, Chutong, et al.
Published: (2025)
Text Style Transfer with Parameter-efficient LLM Finetuning and Round-trip Translation
by: Liu, Ruoxi, et al.
Published: (2026)
by: Liu, Ruoxi, et al.
Published: (2026)
Bridging the Gap between Different Vocabularies for LLM Ensemble
by: Xu, Yangyifan, et al.
Published: (2024)
by: Xu, Yangyifan, et al.
Published: (2024)
Learn and Unlearn: Addressing Misinformation in Multilingual LLMs
by: Lu, Taiming, et al.
Published: (2024)
by: Lu, Taiming, et al.
Published: (2024)
Pointer-Generator Networks for Low-Resource Machine Translation: Don't Copy That!
by: Bafna, Niyati, et al.
Published: (2024)
by: Bafna, Niyati, et al.
Published: (2024)
PEAR: Pairwise Evaluation for Automatic Relative Scoring in Machine Translation
by: Proietti, Lorenzo, et al.
Published: (2026)
by: Proietti, Lorenzo, et al.
Published: (2026)
SLIDE: Reference-free Evaluation for Machine Translation using a Sliding Document Window
by: Raunak, Vikas, et al.
Published: (2023)
by: Raunak, Vikas, et al.
Published: (2023)
Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models
by: Li, Tianjian, et al.
Published: (2023)
by: Li, Tianjian, et al.
Published: (2023)
Steering Large Language Models with Register Analysis for Arbitrary Style Transfer
by: Yang, Xinchen, et al.
Published: (2025)
by: Yang, Xinchen, et al.
Published: (2025)
Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies
by: Kocmi, Tom, et al.
Published: (2024)
by: Kocmi, Tom, et al.
Published: (2024)
Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models
by: Alrefaie, Mohamed Taher, et al.
Published: (2024)
by: Alrefaie, Mohamed Taher, et al.
Published: (2024)
DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation
by: Tan, Weiting, et al.
Published: (2024)
by: Tan, Weiting, et al.
Published: (2024)
TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment
by: Li, Chong, et al.
Published: (2026)
by: Li, Chong, et al.
Published: (2026)
Neuron-Level Emotion Control in Speech-Generative Large Audio-Language Models
by: Zhao, Xiutian, et al.
Published: (2026)
by: Zhao, Xiutian, et al.
Published: (2026)
HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation
by: Zhang, Shijie, et al.
Published: (2025)
by: Zhang, Shijie, et al.
Published: (2025)
Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models
by: Balde, Gunjan, et al.
Published: (2024)
by: Balde, Gunjan, et al.
Published: (2024)
When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling
by: Yun, Heecheol, et al.
Published: (2025)
by: Yun, Heecheol, et al.
Published: (2025)
TokAlign: Efficient Vocabulary Adaptation via Token Alignment
by: Li, Chong, et al.
Published: (2025)
by: Li, Chong, et al.
Published: (2025)
Parallel Tokenizers: Rethinking Vocabulary Design for Cross-Lingual Transfer
by: Kautsar, Muhammad Dehan Al, et al.
Published: (2025)
by: Kautsar, Muhammad Dehan Al, et al.
Published: (2025)
PyMarian: Fast Neural Machine Translation and Evaluation in Python
by: Gowda, Thamme, et al.
Published: (2024)
by: Gowda, Thamme, et al.
Published: (2024)
X-ALMA: Plug & Play Modules and Adaptive Rejection for Quality Translation at Scale
by: Xu, Haoran, et al.
Published: (2024)
by: Xu, Haoran, et al.
Published: (2024)
CTC-GMM: CTC guided modality matching for fast and accurate streaming speech translation
by: Zhao, Rui, et al.
Published: (2024)
by: Zhao, Rui, et al.
Published: (2024)
When the Same Coefficients Reach Different Places: Asymmetric Realizability in Transplanting Tokenizers across Large Language Models
by: Liu, Xiaoze, et al.
Published: (2025)
by: Liu, Xiaoze, et al.
Published: (2025)
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
by: Huang, Hongzhi, et al.
Published: (2025)
by: Huang, Hongzhi, et al.
Published: (2025)
Interchangeable Token Embeddings for Extendable Vocabulary and Alpha-Equivalence
by: Işık, İlker, et al.
Published: (2024)
by: Işık, İlker, et al.
Published: (2024)
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training
by: Chizhov, Pavel, et al.
Published: (2024)
by: Chizhov, Pavel, et al.
Published: (2024)
Small Vocabularies, Big Gains: Pretraining and Tokenization in Time Series Models
by: Roger, Alexis, et al.
Published: (2025)
by: Roger, Alexis, et al.
Published: (2025)
The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
by: Zhao, Qinyu, et al.
Published: (2024)
by: Zhao, Qinyu, et al.
Published: (2024)
Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation
by: Tan, Weiting, et al.
Published: (2025)
by: Tan, Weiting, et al.
Published: (2025)
Cut Your Losses in Large-Vocabulary Language Models
by: Wijmans, Erik, et al.
Published: (2024)
by: Wijmans, Erik, et al.
Published: (2024)
Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models
by: Kando, Shunsuke, et al.
Published: (2025)
by: Kando, Shunsuke, et al.
Published: (2025)
Iterative Auto-Annotation for Scientific Named Entity Recognition Using BERT-Based Models
by: Gupta, Kartik
Published: (2025)
by: Gupta, Kartik
Published: (2025)
The Tokenization Bottleneck: How Vocabulary Extension Improves Chemistry Representation Learning in Pretrained Language Models
by: Kalamkar, Prathamesh, et al.
Published: (2025)
by: Kalamkar, Prathamesh, et al.
Published: (2025)
Vocabulary-level Memory Efficiency for Language Model Fine-tuning
by: Williams, Miles, et al.
Published: (2023)
by: Williams, Miles, et al.
Published: (2023)
Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation
by: Moroni, Luca, et al.
Published: (2025)
by: Moroni, Luca, et al.
Published: (2025)
Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models
by: Sawada, Tomohiro, et al.
Published: (2025)
by: Sawada, Tomohiro, et al.
Published: (2025)
DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
by: Song, Wei, et al.
Published: (2025)
by: Song, Wei, et al.
Published: (2025)
Similar Items
-
Recovering document annotations for sentence-level bitext
by: Wicks, Rachel, et al.
Published: (2024) -
Can you map it to English? The Role of Cross-Lingual Alignment in Multilingual Performance of LLMs
by: Ravisankar, Kartik, et al.
Published: (2025) -
Escaping the sentence-level paradigm in machine translation
by: Post, Matt, et al.
Published: (2023) -
Speech Vecalign: an Embedding-based Method for Aligning Parallel Speech Documents
by: Meng, Chutong, et al.
Published: (2025) -
Text Style Transfer with Parameter-efficient LLM Finetuning and Round-trip Translation
by: Liu, Ruoxi, et al.
Published: (2026)