Saved in:
| Main Authors: | , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.13767 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866917361303420928 |
|---|---|
| author | Liu, Daiqi Enk, Johannes Stone, Maureen Xing, Fangxu Arias-Vergara, Tomás Prince, Jerry L. Hutter, Jana Woo, Jonghye Maier, Andreas Pérez-Toro, Paula Andrea |
| author_facet | Liu, Daiqi Enk, Johannes Stone, Maureen Xing, Fangxu Arias-Vergara, Tomás Prince, Jerry L. Hutter, Jana Woo, Jonghye Maier, Andreas Pérez-Toro, Paula Andrea |
| contents | Accurate segmentation of articulatory structures in real-time MRI (rtMRI) remains challenging, as existing methods rely primarily on visual cues and overlook complementary information from synchronized speech signals. We propose VocSegMRI, a multimodal framework integrating video, audio, and phonological inputs via cross-attention fusion and a contrastive learning objective that improves cross-modal alignment and segmentation precision. Evaluated on USC-75 and further validated via zero-shot transfer on USC-TIMIT, VocSegMRI outperforms unimodal and multimodal baselines, with ablations confirming the contribution of each component. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2509_13767 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | VocSegMRI: Multimodal Learning for Precise Vocal Tract Segmentation in Real-time MRI Liu, Daiqi Enk, Johannes Stone, Maureen Xing, Fangxu Arias-Vergara, Tomás Prince, Jerry L. Hutter, Jana Woo, Jonghye Maier, Andreas Pérez-Toro, Paula Andrea Computer Vision and Pattern Recognition Accurate segmentation of articulatory structures in real-time MRI (rtMRI) remains challenging, as existing methods rely primarily on visual cues and overlook complementary information from synchronized speech signals. We propose VocSegMRI, a multimodal framework integrating video, audio, and phonological inputs via cross-attention fusion and a contrastive learning objective that improves cross-modal alignment and segmentation precision. Evaluated on USC-75 and further validated via zero-shot transfer on USC-TIMIT, VocSegMRI outperforms unimodal and multimodal baselines, with ablations confirming the contribution of each component. |
| title | VocSegMRI: Multimodal Learning for Precise Vocal Tract Segmentation in Real-time MRI |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2509.13767 |