Saved in:
Bibliographic Details
Main Authors: Ren, Liliang, Chen, Congcong, Xu, Haoran, Kim, Young Jin, Atkinson, Adam, Zhan, Zheng, Sun, Jiankai, Peng, Baolin, Liu, Liyuan, Wang, Shuohang, Cheng, Hao, Gao, Jianfeng, Chen, Weizhu, Shen, Yelong
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.06607
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918170909999104
author Ren, Liliang
Chen, Congcong
Xu, Haoran
Kim, Young Jin
Atkinson, Adam
Zhan, Zheng
Sun, Jiankai
Peng, Baolin
Liu, Liyuan
Wang, Shuohang
Cheng, Hao
Gao, Jianfeng
Chen, Weizhu
Shen, Yelong
author_facet Ren, Liliang
Chen, Congcong
Xu, Haoran
Kim, Young Jin
Atkinson, Adam
Zhan, Zheng
Sun, Jiankai
Peng, Baolin
Liu, Liyuan
Wang, Shuohang
Cheng, Hao
Gao, Jianfeng
Chen, Weizhu
Shen, Yelong
contents Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https://github.com/microsoft/ArchScale.
format Preprint
id arxiv_https___arxiv_org_abs_2507_06607
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation
Ren, Liliang
Chen, Congcong
Xu, Haoran
Kim, Young Jin
Atkinson, Adam
Zhan, Zheng
Sun, Jiankai
Peng, Baolin
Liu, Liyuan
Wang, Shuohang
Cheng, Hao
Gao, Jianfeng
Chen, Weizhu
Shen, Yelong
Computation and Language
Machine Learning
Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https://github.com/microsoft/ArchScale.
title Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation
topic Computation and Language
Machine Learning
url https://arxiv.org/abs/2507.06607