Saved in:
| Main Authors: | Ali, Mai, Lucasius, Christopher, Patel, Tanmay P., Aitken, Madison, Vorstman, Jacob, Szatmari, Peter, Battaglia, Marco, Kundur, Deepa |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.23822 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Interpreting Multimodal Communication at Scale in Short-Form Video: Visual, Audio, and Textual Mental Health Discourse on TikTok
by: Zha, Mingyue, et al.
Published: (2026)
by: Zha, Mingyue, et al.
Published: (2026)
VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task
by: Wang, Yuyue, et al.
Published: (2025)
by: Wang, Yuyue, et al.
Published: (2025)
Differential Mental Disorder Detection with Psychology-Inspired Multimodal Stimuli
by: Zhou, Zhiyuan, et al.
Published: (2026)
by: Zhou, Zhiyuan, et al.
Published: (2026)
Multimodal Interaction Modeling via Self-Supervised Multi-Task Learning for Review Helpfulness Prediction
by: Gong, HongLin, et al.
Published: (2024)
by: Gong, HongLin, et al.
Published: (2024)
SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation
by: Huynh, Ngoc Dung, et al.
Published: (2025)
by: Huynh, Ngoc Dung, et al.
Published: (2025)
Voices, Faces, and Feelings: Multi-modal Emotion-Cognition Captioning for Mental Health Understanding
by: Zhou, Zhiyuan, et al.
Published: (2026)
by: Zhou, Zhiyuan, et al.
Published: (2026)
MLLM-based Speech Recognition: When and How is Multimodality Beneficial?
by: Guan, Yiwen, et al.
Published: (2025)
by: Guan, Yiwen, et al.
Published: (2025)
Ada2I: Enhancing Modality Balance for Multimodal Conversational Emotion Recognition
by: Nguyen, Cam-Van Thi, et al.
Published: (2024)
by: Nguyen, Cam-Van Thi, et al.
Published: (2024)
MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks
by: Yang, Xiaocui, et al.
Published: (2024)
by: Yang, Xiaocui, et al.
Published: (2024)
Virbo: Multimodal Multilingual Avatar Video Generation in Digital Marketing
by: Zhang, Juan, et al.
Published: (2024)
by: Zhang, Juan, et al.
Published: (2024)
Multimodal LLM-based Query Paraphrasing for Video Search
by: Wu, Jiaxin, et al.
Published: (2024)
by: Wu, Jiaxin, et al.
Published: (2024)
Target Speech Diarization with Multimodal Prompts
by: Jiang, Yidi, et al.
Published: (2024)
by: Jiang, Yidi, et al.
Published: (2024)
Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark
by: Zhang, Han, et al.
Published: (2025)
by: Zhang, Han, et al.
Published: (2025)
One Framework to Rule Them All: Unifying Multimodal Tasks with LLM Neural-Tuning
by: Sun, Hao, et al.
Published: (2024)
by: Sun, Hao, et al.
Published: (2024)
Is One-Shot In-Context Learning Helpful for Data Selection in Task-Specific Fine-Tuning of Multimodal LLMs?
by: An, Xiao, et al.
Published: (2026)
by: An, Xiao, et al.
Published: (2026)
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation
by: Kim, Sungnyun, et al.
Published: (2025)
by: Kim, Sungnyun, et al.
Published: (2025)
UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts
by: Cheng, Zhi-Qi, et al.
Published: (2024)
by: Cheng, Zhi-Qi, et al.
Published: (2024)
SpeechEE: A Novel Benchmark for Speech Event Extraction
by: Wang, Bin, et al.
Published: (2024)
by: Wang, Bin, et al.
Published: (2024)
DAT: Dual-Aware Adaptive Transmission for Efficient Multimodal LLM Inference in Edge-Cloud Systems
by: Guo, Qi, et al.
Published: (2026)
by: Guo, Qi, et al.
Published: (2026)
MMESGBench: Pioneering Multimodal Understanding and Complex Reasoning Benchmark for ESG Tasks
by: Zhang, Lei, et al.
Published: (2025)
by: Zhang, Lei, et al.
Published: (2025)
SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing
by: Ma, Ziyang, et al.
Published: (2026)
by: Ma, Ziyang, et al.
Published: (2026)
Annotation Techniques for Judo Combat Phase Classification from Tournament Footage
by: Miyaguchi, Anthony, et al.
Published: (2024)
by: Miyaguchi, Anthony, et al.
Published: (2024)
AsCL: An Asymmetry-sensitive Contrastive Learning Method for Image-Text Retrieval with Cross-Modal Fusion
by: Gong, Ziyu, et al.
Published: (2024)
by: Gong, Ziyu, et al.
Published: (2024)
WorldGPT: Empowering LLM as Multimodal World Model
by: Ge, Zhiqi, et al.
Published: (2024)
by: Ge, Zhiqi, et al.
Published: (2024)
MIDI-LLaMA: An Instruction-Following Multimodal LLM for Symbolic Music Understanding
by: Yang, Meng, et al.
Published: (2026)
by: Yang, Meng, et al.
Published: (2026)
MoTAS: MoE-Guided Feature Selection from TTS-Augmented Speech for Enhanced Multimodal Alzheimer's Early Screening
by: Shao, Yongqi, et al.
Published: (2025)
by: Shao, Yongqi, et al.
Published: (2025)
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
by: Yeo, Jeong Hun, et al.
Published: (2025)
by: Yeo, Jeong Hun, et al.
Published: (2025)
HistLLM: A Unified Framework for LLM-Based Multimodal Recommendation with User History Encoding and Compression
by: Zhang, Chen, et al.
Published: (2025)
by: Zhang, Chen, et al.
Published: (2025)
Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment
by: Hong, Joanna, et al.
Published: (2025)
by: Hong, Joanna, et al.
Published: (2025)
When Drawing Is Not Enough: Exploring Spontaneous Speech with Sketch for Intent Alignment in Multimodal LLMs
by: Shi, Weiyan, et al.
Published: (2026)
by: Shi, Weiyan, et al.
Published: (2026)
Computational Analysis of Stress, Depression and Engagement in Mental Health: A Survey
by: Kumar, Puneet, et al.
Published: (2024)
by: Kumar, Puneet, et al.
Published: (2024)
Multimodal Classification and Out-of-distribution Detection for Multimodal Intent Understanding
by: Zhang, Hanlei, et al.
Published: (2024)
by: Zhang, Hanlei, et al.
Published: (2024)
Multimodal Speech Enhancement Using Burst Propagation
by: Raza, Mohsin, et al.
Published: (2022)
by: Raza, Mohsin, et al.
Published: (2022)
Multi-source Multimodal Progressive Domain Adaption for Audio-Visual Deception Detection
by: Lin, Ronghao, et al.
Published: (2025)
by: Lin, Ronghao, et al.
Published: (2025)
Dark Side of Modalities: Reinforced Multimodal Distillation for Multimodal Knowledge Graph Reasoning
by: Zhao, Yu, et al.
Published: (2025)
by: Zhao, Yu, et al.
Published: (2025)
MInD: Improving Multimodal Sentiment Analysis via Multimodal Information Disentanglement
by: Dai, Weichen, et al.
Published: (2024)
by: Dai, Weichen, et al.
Published: (2024)
HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation
by: Cheng, Hongye, et al.
Published: (2025)
by: Cheng, Hongye, et al.
Published: (2025)
TalkSketch: Multimodal Generative AI for Real-time Sketch Ideation with Speech
by: Shi, Weiyan, et al.
Published: (2025)
by: Shi, Weiyan, et al.
Published: (2025)
Hyperbolic Multimodal Generative Representation Learning for Generalized Zero-Shot Multimodal Information Extraction
by: Zhou, Baohang, et al.
Published: (2026)
by: Zhou, Baohang, et al.
Published: (2026)
HyperFusion: Hierarchical Multimodal Ensemble Learning for Social Media Popularity Prediction
by: Ye, Liliang, et al.
Published: (2025)
by: Ye, Liliang, et al.
Published: (2025)
Similar Items
-
Interpreting Multimodal Communication at Scale in Short-Form Video: Visual, Audio, and Textual Mental Health Discourse on TikTok
by: Zha, Mingyue, et al.
Published: (2026) -
VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task
by: Wang, Yuyue, et al.
Published: (2025) -
Differential Mental Disorder Detection with Psychology-Inspired Multimodal Stimuli
by: Zhou, Zhiyuan, et al.
Published: (2026) -
Multimodal Interaction Modeling via Self-Supervised Multi-Task Learning for Review Helpfulness Prediction
by: Gong, HongLin, et al.
Published: (2024) -
SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation
by: Huynh, Ngoc Dung, et al.
Published: (2025)