Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Yueyu, Lin, Zhiyuan, Li, Yue, Peter, Xiao, Liu
Format:	Preprint
Publié:	2025
Sujets:	Computation and Language
Accès en ligne:	https://arxiv.org/abs/2501.15570
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866915123053985792
author	Yueyu, Lin Zhiyuan, Li Yue, Peter Xiao, Liu
author_facet	Yueyu, Lin Zhiyuan, Li Yue, Peter Xiao, Liu
contents	As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressiveness, we introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention, which aims to make RNN more expressive and demonstrates state tracking ability beyond transformers. We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours using 16 AMD MI300X GPUs while maintaining Qwen 2.5's performance. In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens. We will explain the detailed process and share our insights on building more powerful foundation models. Please note that this is an ongoing work that will be updated continuously. The model checkpoints and source code are available at \href{https://github.com/yynil/RWKVInside}{https://github.com/yynil/RWKVInside}, \href{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}.
format	Preprint
id	arxiv_https___arxiv_org_abs_2501_15570
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer Yueyu, Lin Zhiyuan, Li Yue, Peter Xiao, Liu Computation and Language As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressiveness, we introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention, which aims to make RNN more expressive and demonstrates state tracking ability beyond transformers. We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours using 16 AMD MI300X GPUs while maintaining Qwen 2.5's performance. In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens. We will explain the detailed process and share our insights on building more powerful foundation models. Please note that this is an ongoing work that will be updated continuously. The model checkpoints and source code are available at \href{https://github.com/yynil/RWKVInside}{https://github.com/yynil/RWKVInside}, \href{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}.
title	ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer
topic	Computation and Language
url	https://arxiv.org/abs/2501.15570

Documents similaires