Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.02621 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866908391468695552 |
|---|---|
| author | Li, Zhaoyang Zhou, Haodong Luo, Longjie Li, Xiaoxiao Chen, Yongxin Li, Lin Hong, Qingyang |
| author_facet | Li, Zhaoyang Zhou, Haodong Luo, Longjie Li, Xiaoxiao Chen, Yongxin Li, Lin Hong, Qingyang |
| contents | This paper presents the system developed for Task 1 of the Multi-modal Information-based Speech Processing (MISP) 2025 Challenge. We introduce CASA-Net, an embedding fusion method designed for end-to-end audio-visual speaker diarization (AVSD) systems. CASA-Net incorporates a cross-attention (CA) module to effectively capture cross-modal interactions in audio-visual signals and employs a self-attention (SA) module to learn contextual relationships among audio-visual frames. To further enhance performance, we adopt a training strategy that integrates pseudo-label refinement and retraining, improving the accuracy of timestamp predictions. Additionally, median filtering and overlap averaging are applied as post-processing techniques to eliminate outliers and smooth prediction labels. Our system achieved a diarization error rate (DER) of 8.18% on the evaluation set, representing a relative improvement of 47.3% over the baseline DER of 15.52%. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2506_02621 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Cross-attention and Self-attention for Audio-visual Speaker Diarization in MISP-Meeting Challenge Li, Zhaoyang Zhou, Haodong Luo, Longjie Li, Xiaoxiao Chen, Yongxin Li, Lin Hong, Qingyang Sound This paper presents the system developed for Task 1 of the Multi-modal Information-based Speech Processing (MISP) 2025 Challenge. We introduce CASA-Net, an embedding fusion method designed for end-to-end audio-visual speaker diarization (AVSD) systems. CASA-Net incorporates a cross-attention (CA) module to effectively capture cross-modal interactions in audio-visual signals and employs a self-attention (SA) module to learn contextual relationships among audio-visual frames. To further enhance performance, we adopt a training strategy that integrates pseudo-label refinement and retraining, improving the accuracy of timestamp predictions. Additionally, median filtering and overlap averaging are applied as post-processing techniques to eliminate outliers and smooth prediction labels. Our system achieved a diarization error rate (DER) of 8.18% on the evaluation set, representing a relative improvement of 47.3% over the baseline DER of 15.52%. |
| title | Cross-attention and Self-attention for Audio-visual Speaker Diarization in MISP-Meeting Challenge |
| topic | Sound |
| url | https://arxiv.org/abs/2506.02621 |