Enregistré dans:
Détails bibliographiques
Auteurs principaux: Cao, Di, Fu, Dongjie, Yu, Hai, Zheng, Siqi, Tan, Xu, Jin, Tao
Format: Preprint
Publié: 2026
Sujets:
Accès en ligne:https://arxiv.org/abs/2603.24596
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866915898040778752
author Cao, Di
Fu, Dongjie
Yu, Hai
Zheng, Siqi
Tan, Xu
Jin, Tao
author_facet Cao, Di
Fu, Dongjie
Yu, Hai
Zheng, Siqi
Tan, Xu
Jin, Tao
contents While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher's capabilities into student's multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model's inherent capabilities.
format Preprint
id arxiv_https___arxiv_org_abs_2603_24596
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs
Cao, Di
Fu, Dongjie
Yu, Hai
Zheng, Siqi
Tan, Xu
Jin, Tao
Audio and Speech Processing
Artificial Intelligence
Computation and Language
While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher's capabilities into student's multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model's inherent capabilities.
title X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs
topic Audio and Speech Processing
Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2603.24596