:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Yadav, Saumitra, Shrivastava, Manish
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2511.03383
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Get away with less: Need of source side data curation to build parallel corpus for low resource Machine Translation
by: Yadav, Saumitra, et al.
Published: (2026)

Theoretical Analysis of Byte-Pair Encoding
by: Kozma, László, et al.
Published: (2024)

A Formal Perspective on Byte-Pair Encoding
by: Zouhar, Vilém, et al.
Published: (2023)

Entropy-Driven Pre-Tokenization for Byte-Pair Encoding
by: Hu, Yifan, et al.
Published: (2025)

Language Models over Canonical Byte-Pair Encodings
by: Vieira, Tim, et al.
Published: (2025)

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
by: Zhang, Wanpeng, et al.
Published: (2024)

Can Constructions "SCAN" Compositionality ?
by: Katrapati, Ganesh, et al.
Published: (2025)

Analyzing Byte-Pair Encoding on Monophonic and Polyphonic Symbolic Music: A Focus on Musical Phrase Segmentation
by: Le, Dinh-Viet-Toan, et al.
Published: (2024)

Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier
by: Schmidt, Craig W., et al.
Published: (2025)

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization
by: Foroutan, Negar, et al.
Published: (2025)

Byte Pair Encoding Is All You Need For Automatic Bengali Speech Recognition
by: Samin, Ahnaf Mozib
Published: (2024)

Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods
by: Sapkota, Ganesh, et al.
Published: (2025)

Peek2: Regex-free Byte-level Byte-Pair Encoding Pretokenizer for LLM Inference on Edge Devices
by: Zai, Liu, et al.
Published: (2026)

Beyond Single-Reward: Multi-Pair, Multi-Perspective Preference Optimization for Machine Translation
by: Wang, Hao, et al.
Published: (2025)

Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal
by: Lian, Haoran, et al.
Published: (2024)

MARCUS: An Event-Centric NLP Pipeline that generates Character Arcs from Narratives
by: Bhyravajjula, Sriharsh, et al.
Published: (2025)

Automatic Normalization of Word Variations in Code-Mixed Social Media Text
by: Singh, Rajat, et al.
Published: (2018)

LastResort at SemEval-2024 Task 3: Exploring Multimodal Emotion Cause Pair Extraction as Sequence Labelling Task
by: Mathur, Suyash Vardhan, et al.
Published: (2024)

Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation
by: Huang, Langlin, et al.
Published: (2024)

Breaking It Down: Domain-Aware Semantic Segmentation for Retrieval Augmented Generation
by: Allamraju, Aparajitha, et al.
Published: (2025)

MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation
by: Huang, Langlin, et al.
Published: (2024)

Sentiment Analysis of Code-Mixed Languages leveraging Resource Rich Languages
by: Choudhary, Nurendra, et al.
Published: (2018)

Zero-Shot Multi-task Hallucination Detection
by: Bhamidipati, Patanjali, et al.
Published: (2024)

TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu
by: Kanumolu, Gopichand, et al.
Published: (2024)

Emotions are Universal: Learning Sentiment Based Representations of Resource-Poor Languages using Siamese Networks
by: Choudhary, Nurendra, et al.
Published: (2018)

Neural Network Architecture for Credibility Assessment of Textual Claims
by: Choudhary, Nurendra, et al.
Published: (2018)

Contrastive Learning of Emoji-based Representations for Resource-Poor Languages
by: Choudhary, Nurendra, et al.
Published: (2018)

Segmentation-Free Streaming Machine Translation
by: Iranzo-Sánchez, Javier, et al.
Published: (2023)

DaVinci at SemEval-2024 Task 9: Few-shot prompting GPT-3.5 for Unconventional Reasoning
by: Mathur, Suyash Vardhan, et al.
Published: (2024)

Team ACK at SemEval-2025 Task 2: Beyond Word-for-Word Machine Translation for English-Korean Pairs
by: Lee, Daniel, et al.
Published: (2025)

Asymmetric Conflict and Synergy in Post-training for LLM-based Multilingual Machine Translation
by: Zheng, Tong, et al.
Published: (2025)

TabXEval: Why this is a Bad Table? An eXhaustive Rubric for Table Evaluation
by: Pancholi, Vihang, et al.
Published: (2025)

A Survey of using Large Language Models for Generating Infrastructure as Code
by: Srivatsa, Kalahasti Ganesh, et al.
Published: (2024)

Segment-Based Interactive Machine Translation for Pre-trained Models
by: Navarro, Angel, et al.
Published: (2024)

Low-resource Machine Translation for Code-switched Kazakh-Russian Language Pair
by: Borisov, Maksim, et al.
Published: (2025)

Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics
by: Perrella, Stefano, et al.
Published: (2024)

Different Speech Translation Models Encode and Translate Speaker Gender Differently
by: Fucci, Dennis, et al.
Published: (2025)

Beyond Reproduction: A Paired-Task Framework for Assessing LLM Comprehension and Creativity in Literary Translation
by: Zhang, Ran, et al.
Published: (2026)

Evaluating Machine Translation Models for English-Hindi Language Pairs: A Comparative Analysis
by: Shetty, Ahan Prasannakumar
Published: (2025)

A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models
by: Xu, Haoran, et al.
Published: (2023)