Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.05609 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866911044937449472 |
|---|---|
| author | Liu, Yang Wan, Li Huang, Yiteng Xu, Yong shi, yangyang Adya, Saurabh sun, ming Metze, Florian |
| author_facet | Liu, Yang Wan, Li Huang, Yiteng Xu, Yong shi, yangyang Adya, Saurabh sun, ming Metze, Florian |
| contents | Smart glasses are increasingly positioned as the next-generation interface for ubiquitous access to large language models (LLMs). Nevertheless, achieving reliable interaction in real-world noisy environments remains a major challenge, particularly due to interference from side speech. In this work, we introduce a novel side-talk rejection multi-microphone Whisper (MMW) framework for smart glasses, incorporating three key innovations. First, we propose a Mix Block based on a Tri-Mamba architecture to effectively fuse multi-channel audio at the raw waveform level, while maintaining compatibility with streaming processing. Second, we design a Frame Diarization Mamba Layer to enhance frame-level side-talk suppression, facilitating more efficient fine-tuning of Whisper models. Third, we employ a Multi-Scale Group Relative Policy Optimization (GRPO) strategy to jointly optimize frame-level and utterance-level side speech suppression. Experimental evaluations demonstrate that the proposed MMW system can reduce the word error rate (WER) by 4.95\% in noisy conditions. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2507_05609 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | MMW: Side Talk Rejection Multi-Microphone Whisper on Smart Glasses Liu, Yang Wan, Li Huang, Yiteng Xu, Yong shi, yangyang Adya, Saurabh sun, ming Metze, Florian Audio and Speech Processing Smart glasses are increasingly positioned as the next-generation interface for ubiquitous access to large language models (LLMs). Nevertheless, achieving reliable interaction in real-world noisy environments remains a major challenge, particularly due to interference from side speech. In this work, we introduce a novel side-talk rejection multi-microphone Whisper (MMW) framework for smart glasses, incorporating three key innovations. First, we propose a Mix Block based on a Tri-Mamba architecture to effectively fuse multi-channel audio at the raw waveform level, while maintaining compatibility with streaming processing. Second, we design a Frame Diarization Mamba Layer to enhance frame-level side-talk suppression, facilitating more efficient fine-tuning of Whisper models. Third, we employ a Multi-Scale Group Relative Policy Optimization (GRPO) strategy to jointly optimize frame-level and utterance-level side speech suppression. Experimental evaluations demonstrate that the proposed MMW system can reduce the word error rate (WER) by 4.95\% in noisy conditions. |
| title | MMW: Side Talk Rejection Multi-Microphone Whisper on Smart Glasses |
| topic | Audio and Speech Processing |
| url | https://arxiv.org/abs/2507.05609 |