Enregistré dans:
| Auteurs principaux: | , , , , , |
|---|---|
| Format: | Preprint |
| Publié: |
2026
|
| Sujets: | |
| Accès en ligne: | https://arxiv.org/abs/2603.24596 |
| Tags: |
Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
|
| _version_ | 1866915898040778752 |
|---|---|
| author | Cao, Di Fu, Dongjie Yu, Hai Zheng, Siqi Tan, Xu Jin, Tao |
| author_facet | Cao, Di Fu, Dongjie Yu, Hai Zheng, Siqi Tan, Xu Jin, Tao |
| contents | While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher's capabilities into student's multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model's inherent capabilities. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2603_24596 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs Cao, Di Fu, Dongjie Yu, Hai Zheng, Siqi Tan, Xu Jin, Tao Audio and Speech Processing Artificial Intelligence Computation and Language While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher's capabilities into student's multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model's inherent capabilities. |
| title | X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs |
| topic | Audio and Speech Processing Artificial Intelligence Computation and Language |
| url | https://arxiv.org/abs/2603.24596 |