Salvato in:
Dettagli Bibliografici
Autori principali: Yang, Lei, Pan, Leiyu, Xiong, Bojian, Jin, Renren, Zhang, Shaowei, Chen, Yue, Shi, Ling, Zhou, Jiang, Wu, Junru, Wang, Zhen, Peng, Jianxiang, Xiao, Juesi, Dong, Tianyu, Han, Zhuowen, Chen, Zhuo, Ren, Yuqi, Xiong, Deyi
Natura: Preprint
Pubblicazione: 2025
Soggetti:
Accesso online:https://arxiv.org/abs/2507.09205
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866917489491836928
author Yang, Lei
Pan, Leiyu
Xiong, Bojian
Jin, Renren
Zhang, Shaowei
Chen, Yue
Shi, Ling
Zhou, Jiang
Wu, Junru
Wang, Zhen
Peng, Jianxiang
Xiao, Juesi
Dong, Tianyu
Han, Zhuowen
Chen, Zhuo
Ren, Yuqi
Xiong, Deyi
author_facet Yang, Lei
Pan, Leiyu
Xiong, Bojian
Jin, Renren
Zhang, Shaowei
Chen, Yue
Shi, Ling
Zhou, Jiang
Wu, Junru
Wang, Zhen
Peng, Jianxiang
Xiao, Juesi
Dong, Tianyu
Han, Zhuowen
Chen, Zhuo
Ren, Yuqi
Xiong, Deyi
contents Large language models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, yet their performance remains heavily biased toward high-resource languages. Tibetan, despite its cultural significance and large speaker population, is still substantially underrepresented. In this work, we present a comprehensive pipeline for advancing Tibetan language modeling through large-scale data curation and continual pre-training. We construct a 72 GB high-quality Tibetan corpus, the largest to date, and adapt Qwen2.5-7B through balanced multilingual continual pre-training with Tibetan, Chinese, and English, followed by multilingual instruction tuning. To further scale capacity efficiently, we extend the dense model to a 50B-A10B Mixture-of-Experts architecture. Due to the absence of standardized Tibetan benchmarks, we build multiple evaluation datasets via high-quality translation and human verification. Experimental results show that both dense and MoE models consistently outperform existing open-source and Tibetan-focused models of similar scale across diverse tasks. Our work advances Tibetan-centric LLM research and provides transferable insights for extending LLMs to other low-resource languages. We will release the model weights, evaluation benchmarks, and detailed data processing documentation in the follow-up.
format Preprint
id arxiv_https___arxiv_org_abs_2507_09205
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle From Curated Data to Scalable Models: Continual Pre-training of Dense and MoE Large Language Models for Tibetan
Yang, Lei
Pan, Leiyu
Xiong, Bojian
Jin, Renren
Zhang, Shaowei
Chen, Yue
Shi, Ling
Zhou, Jiang
Wu, Junru
Wang, Zhen
Peng, Jianxiang
Xiao, Juesi
Dong, Tianyu
Han, Zhuowen
Chen, Zhuo
Ren, Yuqi
Xiong, Deyi
Computation and Language
Large language models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, yet their performance remains heavily biased toward high-resource languages. Tibetan, despite its cultural significance and large speaker population, is still substantially underrepresented. In this work, we present a comprehensive pipeline for advancing Tibetan language modeling through large-scale data curation and continual pre-training. We construct a 72 GB high-quality Tibetan corpus, the largest to date, and adapt Qwen2.5-7B through balanced multilingual continual pre-training with Tibetan, Chinese, and English, followed by multilingual instruction tuning. To further scale capacity efficiently, we extend the dense model to a 50B-A10B Mixture-of-Experts architecture. Due to the absence of standardized Tibetan benchmarks, we build multiple evaluation datasets via high-quality translation and human verification. Experimental results show that both dense and MoE models consistently outperform existing open-source and Tibetan-focused models of similar scale across diverse tasks. Our work advances Tibetan-centric LLM research and provides transferable insights for extending LLMs to other low-resource languages. We will release the model weights, evaluation benchmarks, and detailed data processing documentation in the follow-up.
title From Curated Data to Scalable Models: Continual Pre-training of Dense and MoE Large Language Models for Tibetan
topic Computation and Language
url https://arxiv.org/abs/2507.09205