Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Daiqi, Enk, Johannes, Stone, Maureen, Xing, Fangxu, Arias-Vergara, Tomás, Prince, Jerry L., Hutter, Jana, Woo, Jonghye, Maier, Andreas, Pérez-Toro, Paula Andrea
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2509.13767
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917361303420928
author	Liu, Daiqi Enk, Johannes Stone, Maureen Xing, Fangxu Arias-Vergara, Tomás Prince, Jerry L. Hutter, Jana Woo, Jonghye Maier, Andreas Pérez-Toro, Paula Andrea
author_facet	Liu, Daiqi Enk, Johannes Stone, Maureen Xing, Fangxu Arias-Vergara, Tomás Prince, Jerry L. Hutter, Jana Woo, Jonghye Maier, Andreas Pérez-Toro, Paula Andrea
contents	Accurate segmentation of articulatory structures in real-time MRI (rtMRI) remains challenging, as existing methods rely primarily on visual cues and overlook complementary information from synchronized speech signals. We propose VocSegMRI, a multimodal framework integrating video, audio, and phonological inputs via cross-attention fusion and a contrastive learning objective that improves cross-modal alignment and segmentation precision. Evaluated on USC-75 and further validated via zero-shot transfer on USC-TIMIT, VocSegMRI outperforms unimodal and multimodal baselines, with ablations confirming the contribution of each component.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_13767
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	VocSegMRI: Multimodal Learning for Precise Vocal Tract Segmentation in Real-time MRI Liu, Daiqi Enk, Johannes Stone, Maureen Xing, Fangxu Arias-Vergara, Tomás Prince, Jerry L. Hutter, Jana Woo, Jonghye Maier, Andreas Pérez-Toro, Paula Andrea Computer Vision and Pattern Recognition Accurate segmentation of articulatory structures in real-time MRI (rtMRI) remains challenging, as existing methods rely primarily on visual cues and overlook complementary information from synchronized speech signals. We propose VocSegMRI, a multimodal framework integrating video, audio, and phonological inputs via cross-attention fusion and a contrastive learning objective that improves cross-modal alignment and segmentation precision. Evaluated on USC-75 and further validated via zero-shot transfer on USC-TIMIT, VocSegMRI outperforms unimodal and multimodal baselines, with ablations confirming the contribution of each component.
title	VocSegMRI: Multimodal Learning for Precise Vocal Tract Segmentation in Real-time MRI
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2509.13767

Similar Items