Saved in:
| Main Authors: | Wei, Xiuying, Yadav, Anunay, Pascanu, Razvan, Gulcehre, Caglar |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.04416 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis
by: Wei, Xiuying, et al.
Published: (2024)
by: Wei, Xiuying, et al.
Published: (2024)
Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers
by: Wei, Xiuying, et al.
Published: (2024)
by: Wei, Xiuying, et al.
Published: (2024)
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
by: Wei, Xiuying, et al.
Published: (2026)
by: Wei, Xiuying, et al.
Published: (2026)
Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity
by: Wei, Xiuying, et al.
Published: (2026)
by: Wei, Xiuying, et al.
Published: (2026)
Beyond Autoregression: Fast LLMs via Self-Distillation Through Time
by: Deschenaux, Justin, et al.
Published: (2024)
by: Deschenaux, Justin, et al.
Published: (2024)
Promises, Outlooks and Challenges of Diffusion Language Modeling
by: Deschenaux, Justin, et al.
Published: (2024)
by: Deschenaux, Justin, et al.
Published: (2024)
Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues
by: Orvieto, Antonio, et al.
Published: (2023)
by: Orvieto, Antonio, et al.
Published: (2023)
The Emergence of Chunking Structures with Hierarchical RNN
by: Wu, Zijun, et al.
Published: (2023)
by: Wu, Zijun, et al.
Published: (2023)
No Representation, No Trust: Connecting Representation, Collapse, and Trust Issues in PPO
by: Moalla, Skander, et al.
Published: (2024)
by: Moalla, Skander, et al.
Published: (2024)
From Markov to Laplace: How Mamba In-Context Learns Markov Chains
by: Bondaschi, Marco, et al.
Published: (2025)
by: Bondaschi, Marco, et al.
Published: (2025)
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
by: De, Soham, et al.
Published: (2024)
by: De, Soham, et al.
Published: (2024)
Aligning Large Language Models with Diverse Political Viewpoints
by: Stammbach, Dominik, et al.
Published: (2024)
by: Stammbach, Dominik, et al.
Published: (2024)
Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models
by: Gu, Xiangming, et al.
Published: (2026)
by: Gu, Xiangming, et al.
Published: (2026)
Context-Aware Toxicity Detection in Multiplayer Games: Integrating Domain-Adaptive Pretraining and Match Metadata
by: Schurger-Foy, Adrien, et al.
Published: (2025)
by: Schurger-Foy, Adrien, et al.
Published: (2025)
BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers
by: Deschenaux, Justin, et al.
Published: (2026)
by: Deschenaux, Justin, et al.
Published: (2026)
Fleet of Agents: Coordinated Problem Solving with Large Language Models
by: Klein, Lars, et al.
Published: (2024)
by: Klein, Lars, et al.
Published: (2024)
Self-Recognition in Language Models
by: Davidson, Tim R., et al.
Published: (2024)
by: Davidson, Tim R., et al.
Published: (2024)
Round and Round We Go! What makes Rotary Positional Encodings useful?
by: Barbero, Federico, et al.
Published: (2024)
by: Barbero, Federico, et al.
Published: (2024)
Perplexity Cannot Always Tell Right from Wrong
by: Veličković, Petar, et al.
Published: (2026)
by: Veličković, Petar, et al.
Published: (2026)
How do language models learn facts? Dynamics, curricula and hallucinations
by: Zucchet, Nicolas, et al.
Published: (2025)
by: Zucchet, Nicolas, et al.
Published: (2025)
The Illusion of Stochasticity in LLMs
by: Gu, Xiangming, et al.
Published: (2026)
by: Gu, Xiangming, et al.
Published: (2026)
CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling
by: Bai, Yu, et al.
Published: (2024)
by: Bai, Yu, et al.
Published: (2024)
Why do LLMs attend to the first token?
by: Barbero, Federico, et al.
Published: (2025)
by: Barbero, Federico, et al.
Published: (2025)
Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions
by: He, Zhihao, et al.
Published: (2024)
by: He, Zhihao, et al.
Published: (2024)
LLM-as-RNN: A Recurrent Language Model for Memory Updates and Sequence Prediction
by: Lu, Yuxing, et al.
Published: (2026)
by: Lu, Yuxing, et al.
Published: (2026)
A Study of the Plausibility of Attention between RNN Encoders in Natural Language Inference
by: Nguyen, Duc Hau, et al.
Published: (2025)
by: Nguyen, Duc Hau, et al.
Published: (2025)
ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer
by: Yueyu, Lin, et al.
Published: (2025)
by: Yueyu, Lin, et al.
Published: (2025)
Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models
by: Rannen-Triki, Amal, et al.
Published: (2024)
by: Rannen-Triki, Amal, et al.
Published: (2024)
CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection
by: Song, Jiwon, et al.
Published: (2026)
by: Song, Jiwon, et al.
Published: (2026)
Chunking, Retrieval, and Re-ranking: An Empirical Evaluation of RAG Architectures for Policy Document Question Answering
by: Maharjan, Anuj, et al.
Published: (2026)
by: Maharjan, Anuj, et al.
Published: (2026)
Dynamic Chunking for Diffusion Language Models
by: Zhu, Yichen, et al.
Published: (2026)
by: Zhu, Yichen, et al.
Published: (2026)
MesaNet: Sequence Modeling by Locally Optimal Test-Time Training
by: von Oswald, Johannes, et al.
Published: (2025)
by: von Oswald, Johannes, et al.
Published: (2025)
Transformers meet Neural Algorithmic Reasoners
by: Bounsi, Wilfried, et al.
Published: (2024)
by: Bounsi, Wilfried, et al.
Published: (2024)
Chunk-Distilled Language Modeling
by: Li, Yanhong, et al.
Published: (2024)
by: Li, Yanhong, et al.
Published: (2024)
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
by: Ye, Lu, et al.
Published: (2024)
by: Ye, Lu, et al.
Published: (2024)
ChunkRAG: Novel LLM-Chunk Filtering Method for RAG Systems
by: Singh, Ishneet Sukhvinder, et al.
Published: (2024)
by: Singh, Ishneet Sukhvinder, et al.
Published: (2024)
Learning Transductions and Alignments with RNN Seq2seq Models
by: Wang, Zhengxiang
Published: (2023)
by: Wang, Zhengxiang
Published: (2023)
Transformers need glasses! Information over-squashing in language tasks
by: Barbero, Federico, et al.
Published: (2024)
by: Barbero, Federico, et al.
Published: (2024)
Beyond Chunk-Local Extraction: Cross-Chunk Graph Augmentation for GraphRAG
by: Zhang, Jiaming, et al.
Published: (2026)
by: Zhang, Jiaming, et al.
Published: (2026)
GhostRNN: Reducing State Redundancy in RNN with Cheap Operations
by: Zhou, Hang, et al.
Published: (2024)
by: Zhou, Hang, et al.
Published: (2024)
Similar Items
-
Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis
by: Wei, Xiuying, et al.
Published: (2024) -
Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers
by: Wei, Xiuying, et al.
Published: (2024) -
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
by: Wei, Xiuying, et al.
Published: (2026) -
Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity
by: Wei, Xiuying, et al.
Published: (2026) -
Beyond Autoregression: Fast LLMs via Self-Distillation Through Time
by: Deschenaux, Justin, et al.
Published: (2024)