Saved in:
| Main Authors: | Ikeda, Wataru, Yano, Kazuki, Takahashi, Ryosuke, Lee, Jaesung, Shibata, Keigo, Suzuki, Jun |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.17734 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Suppressing Final Layer Hidden State Jumps in Transformer Pretraining
by: Shibata, Keigo, et al.
Published: (2026)
by: Shibata, Keigo, et al.
Published: (2026)
STEP: Staged Parameter-Efficient Pre-training for Large Language Models
by: Yano, Kazuki, et al.
Published: (2025)
by: Yano, Kazuki, et al.
Published: (2025)
Adapting Text LLMs to Speech via Multimodal Depth Up-Scaling
by: Yano, Kazuki, et al.
Published: (2026)
by: Yano, Kazuki, et al.
Published: (2026)
TimeMachine-bench: A Benchmark for Evaluating Model Capabilities in Repository-Level Migration Tasks
by: Fujii, Ryo, et al.
Published: (2026)
by: Fujii, Ryo, et al.
Published: (2026)
Efficient Construction of Model Family through Progressive Training Using Model Expansion
by: Yano, Kazuki, et al.
Published: (2025)
by: Yano, Kazuki, et al.
Published: (2025)
Reconsidering Positional Supervision in Masked Diffusion Language Model Training
by: Ye, Mengyu, et al.
Published: (2026)
by: Ye, Mengyu, et al.
Published: (2026)
Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning
by: Yano, Kazuki, et al.
Published: (2026)
by: Yano, Kazuki, et al.
Published: (2026)
LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning
by: Pan, Rui, et al.
Published: (2024)
by: Pan, Rui, et al.
Published: (2024)
Merging Feed-Forward Sublayers for Compressed Transformers
by: Verma, Neha, et al.
Published: (2025)
by: Verma, Neha, et al.
Published: (2025)
An Open and Reproducible Deep Research Agent for Long-Form Question Answering
by: Yamada, Ikuya, et al.
Published: (2025)
by: Yamada, Ikuya, et al.
Published: (2025)
Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models
by: Min, Nay Myat, et al.
Published: (2026)
by: Min, Nay Myat, et al.
Published: (2026)
DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers
by: Langedijk, Anna, et al.
Published: (2023)
by: Langedijk, Anna, et al.
Published: (2023)
FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference
by: Liu, Zirui, et al.
Published: (2024)
by: Liu, Zirui, et al.
Published: (2024)
Exploring Narrative Clustering in Large Language Models: A Layerwise Analysis of BERT
by: Banerjee, Awritrojit, et al.
Published: (2025)
by: Banerjee, Awritrojit, et al.
Published: (2025)
Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps
by: Kobayashi, Goro, et al.
Published: (2023)
by: Kobayashi, Goro, et al.
Published: (2023)
DLP: Dynamic Layerwise Pruning in Large Language Models
by: Chen, Yuli, et al.
Published: (2025)
by: Chen, Yuli, et al.
Published: (2025)
Identifying Speaker Information in Feed-Forward Layers of Self-Supervised Speech Transformers
by: Lin, Tzu-Quan, et al.
Published: (2025)
by: Lin, Tzu-Quan, et al.
Published: (2025)
TopK Language Models
by: Takahashi, Ryosuke, et al.
Published: (2025)
by: Takahashi, Ryosuke, et al.
Published: (2025)
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers
by: Bozic, Vukasin, et al.
Published: (2023)
by: Bozic, Vukasin, et al.
Published: (2023)
Layerwise Change of Knowledge in Neural Networks
by: Cheng, Xu, et al.
Published: (2024)
by: Cheng, Xu, et al.
Published: (2024)
Flash Multi-Head Feed-Forward Network
by: Zhang, Minshen, et al.
Published: (2025)
by: Zhang, Minshen, et al.
Published: (2025)
TESS 2: A Large-Scale Generalist Diffusion Language Model
by: Tae, Jaesung, et al.
Published: (2025)
by: Tae, Jaesung, et al.
Published: (2025)
Layerwise Recurrent Router for Mixture-of-Experts
by: Qiu, Zihan, et al.
Published: (2024)
by: Qiu, Zihan, et al.
Published: (2024)
Adaptive Large Language Models By Layerwise Attention Shortcuts
by: Verma, Prateek, et al.
Published: (2024)
by: Verma, Prateek, et al.
Published: (2024)
LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation
by: Wu, Haihang
Published: (2024)
by: Wu, Haihang
Published: (2024)
The Curse of Popularity: Popular Entities have Catastrophic Side Effects when Deleting Knowledge from Language Models
by: Takahashi, Ryosuke, et al.
Published: (2024)
by: Takahashi, Ryosuke, et al.
Published: (2024)
From Compression to Expression: A Layerwise Analysis of In-Context Learning
by: Jiang, Jiachen, et al.
Published: (2025)
by: Jiang, Jiachen, et al.
Published: (2025)
Task Structure Reverses Layerwise State Encoding in Sequence Models
by: Jiang, Yuhang
Published: (2026)
by: Jiang, Yuhang
Published: (2026)
Task-driven Layerwise Additive Activation Intervention
by: Nguyen, Hieu Trung, et al.
Published: (2025)
by: Nguyen, Hieu Trung, et al.
Published: (2025)
Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models
by: Gerber, Isaac
Published: (2025)
by: Gerber, Isaac
Published: (2025)
Theory of Hallucinations based on Equivariance
by: Shibata, Hisaichi
Published: (2023)
by: Shibata, Hisaichi
Published: (2023)
Supernodes and Halos: Loss-Critical Hubs in LLM Feed-Forward Layers
by: Cherilyn, Audrey, et al.
Published: (2026)
by: Cherilyn, Audrey, et al.
Published: (2026)
Fine-Tuning Language Models with Just Forward Passes
by: Malladi, Sadhika, et al.
Published: (2023)
by: Malladi, Sadhika, et al.
Published: (2023)
Can Language Models Handle a Non-Gregorian Calendar? The Case of the Japanese wareki
by: Sasaki, Mutsumi, et al.
Published: (2025)
by: Sasaki, Mutsumi, et al.
Published: (2025)
Layerwise Recall and the Geometry of Interwoven Knowledge in LLMs
by: Lei, Ge, et al.
Published: (2025)
by: Lei, Ge, et al.
Published: (2025)
LogitTrace: Detecting Benchmark Contamination via Layerwise Logit Trajectories
by: He, Zirui, et al.
Published: (2025)
by: He, Zirui, et al.
Published: (2025)
Can Large Language Models Invent Algorithms to Improve Themselves?: Algorithm Discovery for Recursive Self-Improvement through Reinforcement Learning
by: Ishibashi, Yoichi, et al.
Published: (2024)
by: Ishibashi, Yoichi, et al.
Published: (2024)
Stabilizing Reasoning in Medical LLMs with Continued Pretraining and Reasoning Preference Optimization
by: Kawakami, Wataru, et al.
Published: (2025)
by: Kawakami, Wataru, et al.
Published: (2025)
LCES: Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models
by: Shibata, Takumi, et al.
Published: (2025)
by: Shibata, Takumi, et al.
Published: (2025)
Pruning Multilingual Large Language Models for Multilingual Inference
by: Kim, Hwichan, et al.
Published: (2024)
by: Kim, Hwichan, et al.
Published: (2024)
Similar Items
-
Suppressing Final Layer Hidden State Jumps in Transformer Pretraining
by: Shibata, Keigo, et al.
Published: (2026) -
STEP: Staged Parameter-Efficient Pre-training for Large Language Models
by: Yano, Kazuki, et al.
Published: (2025) -
Adapting Text LLMs to Speech via Multimodal Depth Up-Scaling
by: Yano, Kazuki, et al.
Published: (2026) -
TimeMachine-bench: A Benchmark for Evaluating Model Capabilities in Repository-Level Migration Tasks
by: Fujii, Ryo, et al.
Published: (2026) -
Efficient Construction of Model Family through Progressive Training Using Model Expansion
by: Yano, Kazuki, et al.
Published: (2025)