:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wei, Xiuying, Yadav, Anunay, Pascanu, Razvan, Gulcehre, Caglar
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2507.04416
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis
by: Wei, Xiuying, et al.
Published: (2024)

Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers
by: Wei, Xiuying, et al.
Published: (2024)

RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
by: Wei, Xiuying, et al.
Published: (2026)

Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity
by: Wei, Xiuying, et al.
Published: (2026)

Beyond Autoregression: Fast LLMs via Self-Distillation Through Time
by: Deschenaux, Justin, et al.
Published: (2024)

Promises, Outlooks and Challenges of Diffusion Language Modeling
by: Deschenaux, Justin, et al.
Published: (2024)

Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues
by: Orvieto, Antonio, et al.
Published: (2023)

The Emergence of Chunking Structures with Hierarchical RNN
by: Wu, Zijun, et al.
Published: (2023)

No Representation, No Trust: Connecting Representation, Collapse, and Trust Issues in PPO
by: Moalla, Skander, et al.
Published: (2024)

From Markov to Laplace: How Mamba In-Context Learns Markov Chains
by: Bondaschi, Marco, et al.
Published: (2025)

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
by: De, Soham, et al.
Published: (2024)

Aligning Large Language Models with Diverse Political Viewpoints
by: Stammbach, Dominik, et al.
Published: (2024)

Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models
by: Gu, Xiangming, et al.
Published: (2026)

Context-Aware Toxicity Detection in Multiplayer Games: Integrating Domain-Adaptive Pretraining and Match Metadata
by: Schurger-Foy, Adrien, et al.
Published: (2025)

BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers
by: Deschenaux, Justin, et al.
Published: (2026)

Fleet of Agents: Coordinated Problem Solving with Large Language Models
by: Klein, Lars, et al.
Published: (2024)

Self-Recognition in Language Models
by: Davidson, Tim R., et al.
Published: (2024)

Round and Round We Go! What makes Rotary Positional Encodings useful?
by: Barbero, Federico, et al.
Published: (2024)

Perplexity Cannot Always Tell Right from Wrong
by: Veličković, Petar, et al.
Published: (2026)

How do language models learn facts? Dynamics, curricula and hallucinations
by: Zucchet, Nicolas, et al.
Published: (2025)

The Illusion of Stochasticity in LLMs
by: Gu, Xiangming, et al.
Published: (2026)

CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling
by: Bai, Yu, et al.
Published: (2024)

Why do LLMs attend to the first token?
by: Barbero, Federico, et al.
Published: (2025)

Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions
by: He, Zhihao, et al.
Published: (2024)

LLM-as-RNN: A Recurrent Language Model for Memory Updates and Sequence Prediction
by: Lu, Yuxing, et al.
Published: (2026)

A Study of the Plausibility of Attention between RNN Encoders in Natural Language Inference
by: Nguyen, Duc Hau, et al.
Published: (2025)

ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer
by: Yueyu, Lin, et al.
Published: (2025)

Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models
by: Rannen-Triki, Amal, et al.
Published: (2024)

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection
by: Song, Jiwon, et al.
Published: (2026)

Chunking, Retrieval, and Re-ranking: An Empirical Evaluation of RAG Architectures for Policy Document Question Answering
by: Maharjan, Anuj, et al.
Published: (2026)

Dynamic Chunking for Diffusion Language Models
by: Zhu, Yichen, et al.
Published: (2026)

MesaNet: Sequence Modeling by Locally Optimal Test-Time Training
by: von Oswald, Johannes, et al.
Published: (2025)

Transformers meet Neural Algorithmic Reasoners
by: Bounsi, Wilfried, et al.
Published: (2024)

Chunk-Distilled Language Modeling
by: Li, Yanhong, et al.
Published: (2024)

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
by: Ye, Lu, et al.
Published: (2024)

ChunkRAG: Novel LLM-Chunk Filtering Method for RAG Systems
by: Singh, Ishneet Sukhvinder, et al.
Published: (2024)

Learning Transductions and Alignments with RNN Seq2seq Models
by: Wang, Zhengxiang
Published: (2023)

Transformers need glasses! Information over-squashing in language tasks
by: Barbero, Federico, et al.
Published: (2024)

Beyond Chunk-Local Extraction: Cross-Chunk Graph Augmentation for GraphRAG
by: Zhang, Jiaming, et al.
Published: (2026)

GhostRNN: Reducing State Redundancy in RNN with Cheap Operations
by: Zhou, Hang, et al.
Published: (2024)