Guardado en:
| Autor principal: | Parra, Iñigo |
|---|---|
| Formato: | Preprint |
| Publicado: |
2024
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2410.23656 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
Pretraining Language Models with Subword Regularization: An Empirical Study of BPE Dropout in Low-Resource NLP
por: Visser, Ruan, et al.
Publicado: (2026)
por: Visser, Ruan, et al.
Publicado: (2026)
Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE
por: Patwary, Firoj Ahmmed, et al.
Publicado: (2025)
por: Patwary, Firoj Ahmmed, et al.
Publicado: (2025)
Neural Correlates of Language Models Are Specific to Human Language
por: Parra, Iñigo
Publicado: (2025)
por: Parra, Iñigo
Publicado: (2025)
The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages
por: Meyer, Francois, et al.
Publicado: (2025)
por: Meyer, Francois, et al.
Publicado: (2025)
UnMASKed: Quantifying Gender Biases in Masked Language Models through Linguistically Informed Job Market Prompts
por: Parra, Iñigo
Publicado: (2024)
por: Parra, Iñigo
Publicado: (2024)
BlockBPE: Parallel BPE Tokenization
por: You, Amos
Publicado: (2025)
por: You, Amos
Publicado: (2025)
Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
por: Vemula, Saketh Reddy, et al.
Publicado: (2025)
por: Vemula, Saketh Reddy, et al.
Publicado: (2025)
SuperBPE: Space Travel for Language Models
por: Liu, Alisa, et al.
Publicado: (2025)
por: Liu, Alisa, et al.
Publicado: (2025)
Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models
por: Sawada, Tomohiro, et al.
Publicado: (2025)
por: Sawada, Tomohiro, et al.
Publicado: (2025)
Tokenization Falling Short: On Subword Robustness in Large Language Models
por: Chai, Yekun, et al.
Publicado: (2024)
por: Chai, Yekun, et al.
Publicado: (2024)
Understanding Subword Compositionality of Large Language Models
por: Peng, Qiwei, et al.
Publicado: (2025)
por: Peng, Qiwei, et al.
Publicado: (2025)
IMPACT: Inflectional Morphology Probes Across Complex Typologies
por: Saeed, Mohammed J., et al.
Publicado: (2025)
por: Saeed, Mohammed J., et al.
Publicado: (2025)
Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models
por: Balde, Gunjan, et al.
Publicado: (2024)
por: Balde, Gunjan, et al.
Publicado: (2024)
On the Effect of (Near) Duplicate Subwords in Language Modelling
por: Schäfer, Anton, et al.
Publicado: (2024)
por: Schäfer, Anton, et al.
Publicado: (2024)
Evaluating Morphological Plausibility of Subword Tokenization via Statistical Alignment with Morpho-Syntactic Features
por: Stephen, Abishek, et al.
Publicado: (2026)
por: Stephen, Abishek, et al.
Publicado: (2026)
Batching BPE Tokenization Merges
por: Morgan, Alexander P.
Publicado: (2024)
por: Morgan, Alexander P.
Publicado: (2024)
Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge
por: Batsuren, Khuyagbaatar, et al.
Publicado: (2024)
por: Batsuren, Khuyagbaatar, et al.
Publicado: (2024)
Constructing a BPE Tokenization DFA
por: Berglund, Martin, et al.
Publicado: (2024)
por: Berglund, Martin, et al.
Publicado: (2024)
Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay
por: Altinok, Duygu
Publicado: (2026)
por: Altinok, Duygu
Publicado: (2026)
Distributional Properties of Subword Regularization
por: Cognetta, Marco, et al.
Publicado: (2024)
por: Cognetta, Marco, et al.
Publicado: (2024)
Lexically Grounded Subword Segmentation
por: Libovický, Jindřich, et al.
Publicado: (2024)
por: Libovický, Jindřich, et al.
Publicado: (2024)
MoVoC: Morphology-Aware Subword Construction for Geez Script Languages
por: Teklehaymanot, Hailay Kidu, et al.
Publicado: (2025)
por: Teklehaymanot, Hailay Kidu, et al.
Publicado: (2025)
Byte BPE Tokenization as an Inverse string Homomorphism
por: Geng, Saibo, et al.
Publicado: (2024)
por: Geng, Saibo, et al.
Publicado: (2024)
Bit-level BPE: Below the byte boundary
por: Moon, Sangwhan, et al.
Publicado: (2025)
por: Moon, Sangwhan, et al.
Publicado: (2025)
Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation
por: Gigant, Théo, et al.
Publicado: (2026)
por: Gigant, Théo, et al.
Publicado: (2026)
Can Language Models Learn Typologically Implausible Languages?
por: Xu, Tianyang, et al.
Publicado: (2025)
por: Xu, Tianyang, et al.
Publicado: (2025)
MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies
por: Asgari, Ehsaneddin, et al.
Publicado: (2025)
por: Asgari, Ehsaneddin, et al.
Publicado: (2025)
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation
por: Cognetta, Marco, et al.
Publicado: (2024)
por: Cognetta, Marco, et al.
Publicado: (2024)
From Characters to Tokens: Dynamic Grouping with Hierarchical BPE
por: Dolga, Rares, et al.
Publicado: (2025)
por: Dolga, Rares, et al.
Publicado: (2025)
AdaptBPE: From General Purpose to Specialized Tokenizers
por: Liyanage, Vijini, et al.
Publicado: (2026)
por: Liyanage, Vijini, et al.
Publicado: (2026)
Stolen Subwords: Importance of Vocabularies for Machine Translation Model Stealing
por: Zouhar, Vilém
Publicado: (2024)
por: Zouhar, Vilém
Publicado: (2024)
Leading Whitespaces of Language Models' Subword Vocabulary Pose a Confound for Calculating Word Probabilities
por: Oh, Byung-Doh, et al.
Publicado: (2024)
por: Oh, Byung-Doh, et al.
Publicado: (2024)
Can Pretrained Language Models Derive Correct Semantics from Corrupt Subwords under Noise?
por: Li, Xinzhe, et al.
Publicado: (2023)
por: Li, Xinzhe, et al.
Publicado: (2023)
Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal
por: Lian, Haoran, et al.
Publicado: (2024)
por: Lian, Haoran, et al.
Publicado: (2024)
ByteSpan: Information-Driven Subword Tokenisation
por: Goriely, Zébulon, et al.
Publicado: (2025)
por: Goriely, Zébulon, et al.
Publicado: (2025)
Subword Tokenization Strategies for Kurdish Word Embeddings
por: Salehi, Ali, et al.
Publicado: (2025)
por: Salehi, Ali, et al.
Publicado: (2025)
BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
por: Land, Sander, et al.
Publicado: (2025)
por: Land, Sander, et al.
Publicado: (2025)
Tomato, Tomahto, Tomate: Do Multilingual Language Models Understand Based on Subword-Level Semantic Concepts?
por: Zhang, Crystina, et al.
Publicado: (2024)
por: Zhang, Crystina, et al.
Publicado: (2024)
Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking
por: Deng, Iskar, et al.
Publicado: (2026)
por: Deng, Iskar, et al.
Publicado: (2026)
A Typologically Grounded Evaluation Framework for Word Order and Morphology Sensitivity in Multilingual Masked LMs
por: Feldman, Anna, et al.
Publicado: (2026)
por: Feldman, Anna, et al.
Publicado: (2026)
Ejemplares similares
-
Pretraining Language Models with Subword Regularization: An Empirical Study of BPE Dropout in Low-Resource NLP
por: Visser, Ruan, et al.
Publicado: (2026) -
Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE
por: Patwary, Firoj Ahmmed, et al.
Publicado: (2025) -
Neural Correlates of Language Models Are Specific to Human Language
por: Parra, Iñigo
Publicado: (2025) -
The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages
por: Meyer, Francois, et al.
Publicado: (2025) -
UnMASKed: Quantifying Gender Biases in Masked Language Models through Linguistically Informed Job Market Prompts
por: Parra, Iñigo
Publicado: (2024)