:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Feng, Dawei, Zhang, Yihai, Xu, Zhixuan
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2405.09857
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Small Vocabularies, Big Gains: Pretraining and Tokenization in Time Series Models
by: Roger, Alexis, et al.
Published: (2025)

Token Signature: Predicting Chain-of-Thought Gains with Token Decoding Feature in Large Language Models
by: Liu, Peijie, et al.
Published: (2025)

Reflection Pretraining Enables Token-Level Self-Correction in Biological Sequence Models
by: Zhang, Xiang, et al.
Published: (2025)

Adaptive Computation Pruning for the Forgetting Transformer
by: Lin, Zhixuan, et al.
Published: (2025)

IKnow: Instruction-Knowledge-Aware Continual Pretraining for Effective Domain Adaptation
by: Zhang, Tianyi, et al.
Published: (2025)

ixi-GEN: Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining
by: Kim, Seonwu, et al.
Published: (2025)

Incorporating Domain Knowledge into Materials Tokenization
by: Oh, Yerim, et al.
Published: (2025)

FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering
by: Xue, Siqiao, et al.
Published: (2024)

Optimizing Pretraining Data Mixtures with LLM-Estimated Utility
by: Held, William, et al.
Published: (2025)

Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation
by: Jiang, Yuxuan, et al.
Published: (2026)

LLM-Oriented Token-Adaptive Knowledge Distillation
by: Xie, Xurong, et al.
Published: (2025)

Explainable Token-level Noise Filtering for LLM Fine-tuning Datasets
by: Yang, Yuchen, et al.
Published: (2026)

Next Token Knowledge Tracing: Exploiting Pretrained LLM Representations to Decode Student Behaviour
by: Norris, Max, et al.
Published: (2025)

Better RAG using Relevant Information Gain
by: Pickett, Marc, et al.
Published: (2024)

WRAP++: Web discoveRy Amplified Pretraining
by: Zhou, Jiang, et al.
Published: (2026)

Token-level Direct Preference Optimization
by: Zeng, Yongcheng, et al.
Published: (2024)

FASTopic: Pretrained Transformer is a Fast, Adaptive, Stable, and Transferable Topic Model
by: Wu, Xiaobao, et al.
Published: (2024)

Exploring the Benefits of Domain-Pretraining of Generative Large Language Models for Chemistry
by: Acharya, Anurag, et al.
Published: (2024)

Unleashing Diverse Thinking Modes in LLMs through Multi-Agent Collaboration
by: He, Zhixuan, et al.
Published: (2025)

The Tokenization Bottleneck: How Vocabulary Extension Improves Chemistry Representation Learning in Pretrained Language Models
by: Kalamkar, Prathamesh, et al.
Published: (2025)

Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models
by: Sun, Zhouhao, et al.
Published: (2025)

Adaptive Token Biaser: Knowledge Editing via Biasing Key Entities
by: Bi, Baolong, et al.
Published: (2024)

SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning
by: Li, Zheng, et al.
Published: (2025)

MathPile: A Billion-Token-Scale Pretraining Corpus for Math
by: Wang, Zengzhi, et al.
Published: (2023)

Generating Pretraining Tokens from Organic Data for Data-Bound Scaling
by: Yu, Zichun, et al.
Published: (2026)

Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models
by: Jiang, Peihai, et al.
Published: (2025)

AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation
by: Zhang, Songming, et al.
Published: (2025)

DLLMQuant: Quantizing Diffusion-based Large Language Models
by: Xu, Chen, et al.
Published: (2025)

APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning
by: Cao, Guiming, et al.
Published: (2024)

Less is More for RAG: Information Gain Pruning for Generator-Aligned Reranking and Evidence Selection
by: Song, Zhipeng, et al.
Published: (2026)

KnowledgeGain: Evaluating and Optimizing Science News Generation for Reader Learning
by: Soós, Dominik, et al.
Published: (2026)

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
by: Nguyen, Truong, et al.
Published: (2026)

KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications
by: Bommarito, Michael J, et al.
Published: (2025)

MambaQuant: Quantizing the Mamba Family with Variance Aligned Rotation Methods
by: Xu, Zukang, et al.
Published: (2025)

Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions
by: Liu, Quan, et al.
Published: (2024)

Training LLMs Beyond Next Token Prediction -- Filling the Mutual Information Gap
by: Yang, Chun-Hao, et al.
Published: (2025)

Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents
by: Wang, Guoqing, et al.
Published: (2025)

DPO Meets PPO: Reinforced Token Optimization for RLHF
by: Zhong, Han, et al.
Published: (2024)

CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning
by: Yu, Huimu, et al.
Published: (2024)

Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLMs
by: Yu, Dongxing
Published: (2025)