Saved in:
Bibliographic Details
Main Authors: Liu, Yang, Wan, Li, Huang, Yiteng, Xu, Yong, shi, yangyang, Adya, Saurabh, sun, ming, Metze, Florian
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.05609
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911044937449472
author Liu, Yang
Wan, Li
Huang, Yiteng
Xu, Yong
shi, yangyang
Adya, Saurabh
sun, ming
Metze, Florian
author_facet Liu, Yang
Wan, Li
Huang, Yiteng
Xu, Yong
shi, yangyang
Adya, Saurabh
sun, ming
Metze, Florian
contents Smart glasses are increasingly positioned as the next-generation interface for ubiquitous access to large language models (LLMs). Nevertheless, achieving reliable interaction in real-world noisy environments remains a major challenge, particularly due to interference from side speech. In this work, we introduce a novel side-talk rejection multi-microphone Whisper (MMW) framework for smart glasses, incorporating three key innovations. First, we propose a Mix Block based on a Tri-Mamba architecture to effectively fuse multi-channel audio at the raw waveform level, while maintaining compatibility with streaming processing. Second, we design a Frame Diarization Mamba Layer to enhance frame-level side-talk suppression, facilitating more efficient fine-tuning of Whisper models. Third, we employ a Multi-Scale Group Relative Policy Optimization (GRPO) strategy to jointly optimize frame-level and utterance-level side speech suppression. Experimental evaluations demonstrate that the proposed MMW system can reduce the word error rate (WER) by 4.95\% in noisy conditions.
format Preprint
id arxiv_https___arxiv_org_abs_2507_05609
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle MMW: Side Talk Rejection Multi-Microphone Whisper on Smart Glasses
Liu, Yang
Wan, Li
Huang, Yiteng
Xu, Yong
shi, yangyang
Adya, Saurabh
sun, ming
Metze, Florian
Audio and Speech Processing
Smart glasses are increasingly positioned as the next-generation interface for ubiquitous access to large language models (LLMs). Nevertheless, achieving reliable interaction in real-world noisy environments remains a major challenge, particularly due to interference from side speech. In this work, we introduce a novel side-talk rejection multi-microphone Whisper (MMW) framework for smart glasses, incorporating three key innovations. First, we propose a Mix Block based on a Tri-Mamba architecture to effectively fuse multi-channel audio at the raw waveform level, while maintaining compatibility with streaming processing. Second, we design a Frame Diarization Mamba Layer to enhance frame-level side-talk suppression, facilitating more efficient fine-tuning of Whisper models. Third, we employ a Multi-Scale Group Relative Policy Optimization (GRPO) strategy to jointly optimize frame-level and utterance-level side speech suppression. Experimental evaluations demonstrate that the proposed MMW system can reduce the word error rate (WER) by 4.95\% in noisy conditions.
title MMW: Side Talk Rejection Multi-Microphone Whisper on Smart Glasses
topic Audio and Speech Processing
url https://arxiv.org/abs/2507.05609