Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Mishra, Ayushi, Bai, Yang, Narayanasamy, Priyadarshan, Garg, Nakul, Roy, Nirupam
Format:	Preprint
Published:	2025
Subjects:	Sound Computation and Language Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2504.08907
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913807250489344
author	Mishra, Ayushi Bai, Yang Narayanasamy, Priyadarshan Garg, Nakul Roy, Nirupam
author_facet	Mishra, Ayushi Bai, Yang Narayanasamy, Priyadarshan Garg, Nakul Roy, Nirupam
contents	Integrating spatial context into large language models (LLMs) has the potential to revolutionize human-computer interaction, particularly in wearable devices. In this work, we present a novel system architecture that incorporates spatial speech understanding into LLMs, enabling contextually aware and adaptive applications for wearable technologies. Our approach leverages microstructure-based spatial sensing to extract precise Direction of Arrival (DoA) information using a monaural microphone. To address the lack of existing dataset for microstructure-assisted speech recordings, we synthetically create a dataset called OmniTalk by using the LibriSpeech dataset. This spatial information is fused with linguistic embeddings from OpenAI's Whisper model, allowing each modality to learn complementary contextual representations. The fused embeddings are aligned with the input space of LLaMA-3.2 3B model and fine-tuned with lightweight adaptation technique LoRA to optimize for on-device processing. SING supports spatially-aware automatic speech recognition (ASR), achieving a mean error of $25.72^\circ$-a substantial improvement compared to the 88.52$^\circ$ median error in existing work-with a word error rate (WER) of 5.3. SING also supports soundscaping, for example, inference how many people were talking and their directions, with up to 5 people and a median DoA error of 16$^\circ$. Our system demonstrates superior performance in spatial speech understanding while addressing the challenges of power efficiency, privacy, and hardware constraints, paving the way for advanced applications in augmented reality, accessibility, and immersive experiences.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_08907
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Spatial Audio Processing with Large Language Model on Wearable Devices Mishra, Ayushi Bai, Yang Narayanasamy, Priyadarshan Garg, Nakul Roy, Nirupam Sound Computation and Language Audio and Speech Processing Integrating spatial context into large language models (LLMs) has the potential to revolutionize human-computer interaction, particularly in wearable devices. In this work, we present a novel system architecture that incorporates spatial speech understanding into LLMs, enabling contextually aware and adaptive applications for wearable technologies. Our approach leverages microstructure-based spatial sensing to extract precise Direction of Arrival (DoA) information using a monaural microphone. To address the lack of existing dataset for microstructure-assisted speech recordings, we synthetically create a dataset called OmniTalk by using the LibriSpeech dataset. This spatial information is fused with linguistic embeddings from OpenAI's Whisper model, allowing each modality to learn complementary contextual representations. The fused embeddings are aligned with the input space of LLaMA-3.2 3B model and fine-tuned with lightweight adaptation technique LoRA to optimize for on-device processing. SING supports spatially-aware automatic speech recognition (ASR), achieving a mean error of $25.72^\circ$-a substantial improvement compared to the 88.52$^\circ$ median error in existing work-with a word error rate (WER) of 5.3. SING also supports soundscaping, for example, inference how many people were talking and their directions, with up to 5 people and a median DoA error of 16$^\circ$. Our system demonstrates superior performance in spatial speech understanding while addressing the challenges of power efficiency, privacy, and hardware constraints, paving the way for advanced applications in augmented reality, accessibility, and immersive experiences.
title	Spatial Audio Processing with Large Language Model on Wearable Devices
topic	Sound Computation and Language Audio and Speech Processing
url	https://arxiv.org/abs/2504.08907

Similar Items