Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Yang, Wan, Li, Huang, Yiteng, Xu, Yong, shi, yangyang, Adya, Saurabh, sun, ming, Metze, Florian
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2507.05609
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911044937449472
author	Liu, Yang Wan, Li Huang, Yiteng Xu, Yong shi, yangyang Adya, Saurabh sun, ming Metze, Florian
author_facet	Liu, Yang Wan, Li Huang, Yiteng Xu, Yong shi, yangyang Adya, Saurabh sun, ming Metze, Florian
contents	Smart glasses are increasingly positioned as the next-generation interface for ubiquitous access to large language models (LLMs). Nevertheless, achieving reliable interaction in real-world noisy environments remains a major challenge, particularly due to interference from side speech. In this work, we introduce a novel side-talk rejection multi-microphone Whisper (MMW) framework for smart glasses, incorporating three key innovations. First, we propose a Mix Block based on a Tri-Mamba architecture to effectively fuse multi-channel audio at the raw waveform level, while maintaining compatibility with streaming processing. Second, we design a Frame Diarization Mamba Layer to enhance frame-level side-talk suppression, facilitating more efficient fine-tuning of Whisper models. Third, we employ a Multi-Scale Group Relative Policy Optimization (GRPO) strategy to jointly optimize frame-level and utterance-level side speech suppression. Experimental evaluations demonstrate that the proposed MMW system can reduce the word error rate (WER) by 4.95\% in noisy conditions.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_05609
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	MMW: Side Talk Rejection Multi-Microphone Whisper on Smart Glasses Liu, Yang Wan, Li Huang, Yiteng Xu, Yong shi, yangyang Adya, Saurabh sun, ming Metze, Florian Audio and Speech Processing Smart glasses are increasingly positioned as the next-generation interface for ubiquitous access to large language models (LLMs). Nevertheless, achieving reliable interaction in real-world noisy environments remains a major challenge, particularly due to interference from side speech. In this work, we introduce a novel side-talk rejection multi-microphone Whisper (MMW) framework for smart glasses, incorporating three key innovations. First, we propose a Mix Block based on a Tri-Mamba architecture to effectively fuse multi-channel audio at the raw waveform level, while maintaining compatibility with streaming processing. Second, we design a Frame Diarization Mamba Layer to enhance frame-level side-talk suppression, facilitating more efficient fine-tuning of Whisper models. Third, we employ a Multi-Scale Group Relative Policy Optimization (GRPO) strategy to jointly optimize frame-level and utterance-level side speech suppression. Experimental evaluations demonstrate that the proposed MMW system can reduce the word error rate (WER) by 4.95\% in noisy conditions.
title	MMW: Side Talk Rejection Multi-Microphone Whisper on Smart Glasses
topic	Audio and Speech Processing
url	https://arxiv.org/abs/2507.05609

Similar Items