Saved in:
| Main Authors: | Yadav, Saumitra, Shrivastava, Manish |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.03383 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Get away with less: Need of source side data curation to build parallel corpus for low resource Machine Translation
by: Yadav, Saumitra, et al.
Published: (2026)
by: Yadav, Saumitra, et al.
Published: (2026)
Theoretical Analysis of Byte-Pair Encoding
by: Kozma, László, et al.
Published: (2024)
by: Kozma, László, et al.
Published: (2024)
A Formal Perspective on Byte-Pair Encoding
by: Zouhar, Vilém, et al.
Published: (2023)
by: Zouhar, Vilém, et al.
Published: (2023)
Entropy-Driven Pre-Tokenization for Byte-Pair Encoding
by: Hu, Yifan, et al.
Published: (2025)
by: Hu, Yifan, et al.
Published: (2025)
Language Models over Canonical Byte-Pair Encodings
by: Vieira, Tim, et al.
Published: (2025)
by: Vieira, Tim, et al.
Published: (2025)
From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
by: Zhang, Wanpeng, et al.
Published: (2024)
by: Zhang, Wanpeng, et al.
Published: (2024)
Can Constructions "SCAN" Compositionality ?
by: Katrapati, Ganesh, et al.
Published: (2025)
by: Katrapati, Ganesh, et al.
Published: (2025)
Analyzing Byte-Pair Encoding on Monophonic and Polyphonic Symbolic Music: A Focus on Musical Phrase Segmentation
by: Le, Dinh-Viet-Toan, et al.
Published: (2024)
by: Le, Dinh-Viet-Toan, et al.
Published: (2024)
Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier
by: Schmidt, Craig W., et al.
Published: (2025)
by: Schmidt, Craig W., et al.
Published: (2025)
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization
by: Foroutan, Negar, et al.
Published: (2025)
by: Foroutan, Negar, et al.
Published: (2025)
Byte Pair Encoding Is All You Need For Automatic Bengali Speech Recognition
by: Samin, Ahnaf Mozib
Published: (2024)
by: Samin, Ahnaf Mozib
Published: (2024)
Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods
by: Sapkota, Ganesh, et al.
Published: (2025)
by: Sapkota, Ganesh, et al.
Published: (2025)
Peek2: Regex-free Byte-level Byte-Pair Encoding Pretokenizer for LLM Inference on Edge Devices
by: Zai, Liu, et al.
Published: (2026)
by: Zai, Liu, et al.
Published: (2026)
Beyond Single-Reward: Multi-Pair, Multi-Perspective Preference Optimization for Machine Translation
by: Wang, Hao, et al.
Published: (2025)
by: Wang, Hao, et al.
Published: (2025)
Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal
by: Lian, Haoran, et al.
Published: (2024)
by: Lian, Haoran, et al.
Published: (2024)
MARCUS: An Event-Centric NLP Pipeline that generates Character Arcs from Narratives
by: Bhyravajjula, Sriharsh, et al.
Published: (2025)
by: Bhyravajjula, Sriharsh, et al.
Published: (2025)
Automatic Normalization of Word Variations in Code-Mixed Social Media Text
by: Singh, Rajat, et al.
Published: (2018)
by: Singh, Rajat, et al.
Published: (2018)
LastResort at SemEval-2024 Task 3: Exploring Multimodal Emotion Cause Pair Extraction as Sequence Labelling Task
by: Mathur, Suyash Vardhan, et al.
Published: (2024)
by: Mathur, Suyash Vardhan, et al.
Published: (2024)
Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation
by: Huang, Langlin, et al.
Published: (2024)
by: Huang, Langlin, et al.
Published: (2024)
Breaking It Down: Domain-Aware Semantic Segmentation for Retrieval Augmented Generation
by: Allamraju, Aparajitha, et al.
Published: (2025)
by: Allamraju, Aparajitha, et al.
Published: (2025)
MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation
by: Huang, Langlin, et al.
Published: (2024)
by: Huang, Langlin, et al.
Published: (2024)
Sentiment Analysis of Code-Mixed Languages leveraging Resource Rich Languages
by: Choudhary, Nurendra, et al.
Published: (2018)
by: Choudhary, Nurendra, et al.
Published: (2018)
Zero-Shot Multi-task Hallucination Detection
by: Bhamidipati, Patanjali, et al.
Published: (2024)
by: Bhamidipati, Patanjali, et al.
Published: (2024)
TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu
by: Kanumolu, Gopichand, et al.
Published: (2024)
by: Kanumolu, Gopichand, et al.
Published: (2024)
Emotions are Universal: Learning Sentiment Based Representations of Resource-Poor Languages using Siamese Networks
by: Choudhary, Nurendra, et al.
Published: (2018)
by: Choudhary, Nurendra, et al.
Published: (2018)
Neural Network Architecture for Credibility Assessment of Textual Claims
by: Choudhary, Nurendra, et al.
Published: (2018)
by: Choudhary, Nurendra, et al.
Published: (2018)
Contrastive Learning of Emoji-based Representations for Resource-Poor Languages
by: Choudhary, Nurendra, et al.
Published: (2018)
by: Choudhary, Nurendra, et al.
Published: (2018)
Segmentation-Free Streaming Machine Translation
by: Iranzo-Sánchez, Javier, et al.
Published: (2023)
by: Iranzo-Sánchez, Javier, et al.
Published: (2023)
DaVinci at SemEval-2024 Task 9: Few-shot prompting GPT-3.5 for Unconventional Reasoning
by: Mathur, Suyash Vardhan, et al.
Published: (2024)
by: Mathur, Suyash Vardhan, et al.
Published: (2024)
Team ACK at SemEval-2025 Task 2: Beyond Word-for-Word Machine Translation for English-Korean Pairs
by: Lee, Daniel, et al.
Published: (2025)
by: Lee, Daniel, et al.
Published: (2025)
Asymmetric Conflict and Synergy in Post-training for LLM-based Multilingual Machine Translation
by: Zheng, Tong, et al.
Published: (2025)
by: Zheng, Tong, et al.
Published: (2025)
TabXEval: Why this is a Bad Table? An eXhaustive Rubric for Table Evaluation
by: Pancholi, Vihang, et al.
Published: (2025)
by: Pancholi, Vihang, et al.
Published: (2025)
A Survey of using Large Language Models for Generating Infrastructure as Code
by: Srivatsa, Kalahasti Ganesh, et al.
Published: (2024)
by: Srivatsa, Kalahasti Ganesh, et al.
Published: (2024)
Segment-Based Interactive Machine Translation for Pre-trained Models
by: Navarro, Angel, et al.
Published: (2024)
by: Navarro, Angel, et al.
Published: (2024)
Low-resource Machine Translation for Code-switched Kazakh-Russian Language Pair
by: Borisov, Maksim, et al.
Published: (2025)
by: Borisov, Maksim, et al.
Published: (2025)
Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics
by: Perrella, Stefano, et al.
Published: (2024)
by: Perrella, Stefano, et al.
Published: (2024)
Different Speech Translation Models Encode and Translate Speaker Gender Differently
by: Fucci, Dennis, et al.
Published: (2025)
by: Fucci, Dennis, et al.
Published: (2025)
Beyond Reproduction: A Paired-Task Framework for Assessing LLM Comprehension and Creativity in Literary Translation
by: Zhang, Ran, et al.
Published: (2026)
by: Zhang, Ran, et al.
Published: (2026)
Evaluating Machine Translation Models for English-Hindi Language Pairs: A Comparative Analysis
by: Shetty, Ahan Prasannakumar
Published: (2025)
by: Shetty, Ahan Prasannakumar
Published: (2025)
A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models
by: Xu, Haoran, et al.
Published: (2023)
by: Xu, Haoran, et al.
Published: (2023)
Similar Items
-
Get away with less: Need of source side data curation to build parallel corpus for low resource Machine Translation
by: Yadav, Saumitra, et al.
Published: (2026) -
Theoretical Analysis of Byte-Pair Encoding
by: Kozma, László, et al.
Published: (2024) -
A Formal Perspective on Byte-Pair Encoding
by: Zouhar, Vilém, et al.
Published: (2023) -
Entropy-Driven Pre-Tokenization for Byte-Pair Encoding
by: Hu, Yifan, et al.
Published: (2025) -
Language Models over Canonical Byte-Pair Encodings
by: Vieira, Tim, et al.
Published: (2025)