Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Cao, Di, Fu, Dongjie, Yu, Hai, Zheng, Siqi, Tan, Xu, Jin, Tao
Format:	Preprint
Publié:	2026
Sujets:	Audio and Speech Processing Artificial Intelligence Computation and Language
Accès en ligne:	https://arxiv.org/abs/2603.24596
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866915898040778752
author	Cao, Di Fu, Dongjie Yu, Hai Zheng, Siqi Tan, Xu Jin, Tao
author_facet	Cao, Di Fu, Dongjie Yu, Hai Zheng, Siqi Tan, Xu Jin, Tao
contents	While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher's capabilities into student's multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model's inherent capabilities.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_24596
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs Cao, Di Fu, Dongjie Yu, Hai Zheng, Siqi Tan, Xu Jin, Tao Audio and Speech Processing Artificial Intelligence Computation and Language While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher's capabilities into student's multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model's inherent capabilities.
title	X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs
topic	Audio and Speech Processing Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2603.24596

Documents similaires