Saved in:
Bibliographic Details
Main Authors: Liu, Daiqi, Enk, Johannes, Stone, Maureen, Xing, Fangxu, Arias-Vergara, Tomás, Prince, Jerry L., Hutter, Jana, Woo, Jonghye, Maier, Andreas, Pérez-Toro, Paula Andrea
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2509.13767
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917361303420928
author Liu, Daiqi
Enk, Johannes
Stone, Maureen
Xing, Fangxu
Arias-Vergara, Tomás
Prince, Jerry L.
Hutter, Jana
Woo, Jonghye
Maier, Andreas
Pérez-Toro, Paula Andrea
author_facet Liu, Daiqi
Enk, Johannes
Stone, Maureen
Xing, Fangxu
Arias-Vergara, Tomás
Prince, Jerry L.
Hutter, Jana
Woo, Jonghye
Maier, Andreas
Pérez-Toro, Paula Andrea
contents Accurate segmentation of articulatory structures in real-time MRI (rtMRI) remains challenging, as existing methods rely primarily on visual cues and overlook complementary information from synchronized speech signals. We propose VocSegMRI, a multimodal framework integrating video, audio, and phonological inputs via cross-attention fusion and a contrastive learning objective that improves cross-modal alignment and segmentation precision. Evaluated on USC-75 and further validated via zero-shot transfer on USC-TIMIT, VocSegMRI outperforms unimodal and multimodal baselines, with ablations confirming the contribution of each component.
format Preprint
id arxiv_https___arxiv_org_abs_2509_13767
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle VocSegMRI: Multimodal Learning for Precise Vocal Tract Segmentation in Real-time MRI
Liu, Daiqi
Enk, Johannes
Stone, Maureen
Xing, Fangxu
Arias-Vergara, Tomás
Prince, Jerry L.
Hutter, Jana
Woo, Jonghye
Maier, Andreas
Pérez-Toro, Paula Andrea
Computer Vision and Pattern Recognition
Accurate segmentation of articulatory structures in real-time MRI (rtMRI) remains challenging, as existing methods rely primarily on visual cues and overlook complementary information from synchronized speech signals. We propose VocSegMRI, a multimodal framework integrating video, audio, and phonological inputs via cross-attention fusion and a contrastive learning objective that improves cross-modal alignment and segmentation precision. Evaluated on USC-75 and further validated via zero-shot transfer on USC-TIMIT, VocSegMRI outperforms unimodal and multimodal baselines, with ablations confirming the contribution of each component.
title VocSegMRI: Multimodal Learning for Precise Vocal Tract Segmentation in Real-time MRI
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2509.13767