Saved in:
| Main Authors: | Huang, Ruizhe, Zhang, Kexuan, Fang, Yihao, Yu, Baifeng |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.23862 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
by: Munkhdalai, Tsendsuren, et al.
Published: (2024)
by: Munkhdalai, Tsendsuren, et al.
Published: (2024)
Causal Abstraction in Model Interpretability: A Compact Survey
by: Zhang, Yihao
Published: (2024)
by: Zhang, Yihao
Published: (2024)
DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models
by: Zhang, Yuxuan, et al.
Published: (2024)
by: Zhang, Yuxuan, et al.
Published: (2024)
Generating Pretraining Tokens from Organic Data for Data-Bound Scaling
by: Yu, Zichun, et al.
Published: (2026)
by: Yu, Zichun, et al.
Published: (2026)
To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining
by: Singh, Karan, et al.
Published: (2026)
by: Singh, Karan, et al.
Published: (2026)
SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression
by: S, Santhosh G, et al.
Published: (2025)
by: S, Santhosh G, et al.
Published: (2025)
MaskTab: Scalable Masked Tabular Pretraining with Scaling Laws and Distillation for Industrial Classification
by: Zheng, Bo, et al.
Published: (2026)
by: Zheng, Bo, et al.
Published: (2026)
Pooling Attention: Evaluating Pretrained Transformer Embeddings for Deception Classification
by: Mamtani, Sumit, et al.
Published: (2025)
by: Mamtani, Sumit, et al.
Published: (2025)
Small Vocabularies, Big Gains: Pretraining and Tokenization in Time Series Models
by: Roger, Alexis, et al.
Published: (2025)
by: Roger, Alexis, et al.
Published: (2025)
Limitations of Normalization in Attention Mechanism
by: Mudarisov, Timur, et al.
Published: (2025)
by: Mudarisov, Timur, et al.
Published: (2025)
Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions
by: Li, Ruizhe, et al.
Published: (2024)
by: Li, Ruizhe, et al.
Published: (2024)
MathPile: A Billion-Token-Scale Pretraining Corpus for Math
by: Wang, Zengzhi, et al.
Published: (2023)
by: Wang, Zengzhi, et al.
Published: (2023)
Eigen Attention: Attention in Low-Rank Space for KV Cache Compression
by: Saxena, Utkarsh, et al.
Published: (2024)
by: Saxena, Utkarsh, et al.
Published: (2024)
Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study
by: Tan, Shawn, et al.
Published: (2024)
by: Tan, Shawn, et al.
Published: (2024)
HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs
by: Yang, Dongquan, et al.
Published: (2025)
by: Yang, Dongquan, et al.
Published: (2025)
Goal-Directed Search Outperforms Goal-Agnostic Memory Compression in Long-Context Memory Tasks
by: Zheng, Yicong, et al.
Published: (2025)
by: Zheng, Yicong, et al.
Published: (2025)
Scaling Reasoning without Attention
by: Zhao, Xueliang, et al.
Published: (2025)
by: Zhao, Xueliang, et al.
Published: (2025)
Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation
by: Li, Ruizhe, et al.
Published: (2025)
by: Li, Ruizhe, et al.
Published: (2025)
Clustering-driven Memory Compression for On-device Large Language Models
by: Bohdal, Ondrej, et al.
Published: (2026)
by: Bohdal, Ondrej, et al.
Published: (2026)
Ladder: A Model-Agnostic Framework Boosting LLM-based Machine Translation to the Next Level
by: Feng, Zhaopeng, et al.
Published: (2024)
by: Feng, Zhaopeng, et al.
Published: (2024)
Time and Memory Trade-off of KV-Cache Compression in Tensor Transformer Decoding
by: Chen, Yifang, et al.
Published: (2025)
by: Chen, Yifang, et al.
Published: (2025)
Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset
by: Mahabadi, Rabeeh Karimi, et al.
Published: (2025)
by: Mahabadi, Rabeeh Karimi, et al.
Published: (2025)
MemRerank: Preference Memory for Personalized Product Reranking
by: Peng, Zhiyuan, et al.
Published: (2026)
by: Peng, Zhiyuan, et al.
Published: (2026)
Training-free Ultra Small Model for Universal Sparse Reconstruction in Compressed Sensing
by: Tang, Chaoqing, et al.
Published: (2025)
by: Tang, Chaoqing, et al.
Published: (2025)
VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models
by: Zhang, Hanling, et al.
Published: (2025)
by: Zhang, Hanling, et al.
Published: (2025)
MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining
by: Chen, Zhixun, et al.
Published: (2025)
by: Chen, Zhixun, et al.
Published: (2025)
SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing
by: Xu, Yifei, et al.
Published: (2026)
by: Xu, Yifei, et al.
Published: (2026)
Small Language Models for Application Interactions: A Case Study
by: Li, Beibin, et al.
Published: (2024)
by: Li, Beibin, et al.
Published: (2024)
Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
by: Itzhak, Itay, et al.
Published: (2025)
by: Itzhak, Itay, et al.
Published: (2025)
Mixture of Chapters: Scaling Learnt Memory in Transformers
by: Tibrewal, Tasmay Pankaj, et al.
Published: (2026)
by: Tibrewal, Tasmay Pankaj, et al.
Published: (2026)
Pre-training Limited Memory Language Models with Internal and External Knowledge
by: Zhao, Linxi, et al.
Published: (2025)
by: Zhao, Linxi, et al.
Published: (2025)
Lost in the Prompt Order: Revealing the Limitations of Causal Attention in Language Models
by: Ok, Hyunjong, et al.
Published: (2026)
by: Ok, Hyunjong, et al.
Published: (2026)
Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models
by: Zhang, Jun, et al.
Published: (2025)
by: Zhang, Jun, et al.
Published: (2025)
FineZip : Pushing the Limits of Large Language Models for Practical Lossless Text Compression
by: Mittu, Fazal, et al.
Published: (2024)
by: Mittu, Fazal, et al.
Published: (2024)
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
by: Lin, Bill Yuchen, et al.
Published: (2025)
by: Lin, Bill Yuchen, et al.
Published: (2025)
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
by: Alizadeh, Keivan, et al.
Published: (2023)
by: Alizadeh, Keivan, et al.
Published: (2023)
TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling
by: Lin, Weizhe, et al.
Published: (2025)
by: Lin, Weizhe, et al.
Published: (2025)
A Framework for Inference Inspired by Human Memory Mechanisms
by: Zeng, Xiangyu, et al.
Published: (2023)
by: Zeng, Xiangyu, et al.
Published: (2023)
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
by: Xiaomi, LLM-Core, et al.
Published: (2025)
by: Xiaomi, LLM-Core, et al.
Published: (2025)
Secure LLM Fine-Tuning via Safety-Aware Probing
by: Wu, Chengcan, et al.
Published: (2025)
by: Wu, Chengcan, et al.
Published: (2025)
Similar Items
-
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
by: Munkhdalai, Tsendsuren, et al.
Published: (2024) -
Causal Abstraction in Model Interpretability: A Compact Survey
by: Zhang, Yihao
Published: (2024) -
DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models
by: Zhang, Yuxuan, et al.
Published: (2024) -
Generating Pretraining Tokens from Organic Data for Data-Bound Scaling
by: Yu, Zichun, et al.
Published: (2026) -
To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining
by: Singh, Karan, et al.
Published: (2026)