:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ali, Mai, Lucasius, Christopher, Patel, Tanmay P., Aitken, Madison, Vorstman, Jacob, Szatmari, Peter, Battaglia, Marco, Kundur, Deepa
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Multimedia
Online Access:	https://arxiv.org/abs/2505.23822
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Interpreting Multimodal Communication at Scale in Short-Form Video: Visual, Audio, and Textual Mental Health Discourse on TikTok
by: Zha, Mingyue, et al.
Published: (2026)

VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task
by: Wang, Yuyue, et al.
Published: (2025)

Differential Mental Disorder Detection with Psychology-Inspired Multimodal Stimuli
by: Zhou, Zhiyuan, et al.
Published: (2026)

Multimodal Interaction Modeling via Self-Supervised Multi-Task Learning for Review Helpfulness Prediction
by: Gong, HongLin, et al.
Published: (2024)

SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation
by: Huynh, Ngoc Dung, et al.
Published: (2025)

Voices, Faces, and Feelings: Multi-modal Emotion-Cognition Captioning for Mental Health Understanding
by: Zhou, Zhiyuan, et al.
Published: (2026)

MLLM-based Speech Recognition: When and How is Multimodality Beneficial?
by: Guan, Yiwen, et al.
Published: (2025)

Ada2I: Enhancing Modality Balance for Multimodal Conversational Emotion Recognition
by: Nguyen, Cam-Van Thi, et al.
Published: (2024)

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks
by: Yang, Xiaocui, et al.
Published: (2024)

Virbo: Multimodal Multilingual Avatar Video Generation in Digital Marketing
by: Zhang, Juan, et al.
Published: (2024)

Multimodal LLM-based Query Paraphrasing for Video Search
by: Wu, Jiaxin, et al.
Published: (2024)

Target Speech Diarization with Multimodal Prompts
by: Jiang, Yidi, et al.
Published: (2024)

Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark
by: Zhang, Han, et al.
Published: (2025)

One Framework to Rule Them All: Unifying Multimodal Tasks with LLM Neural-Tuning
by: Sun, Hao, et al.
Published: (2024)

Is One-Shot In-Context Learning Helpful for Data Selection in Task-Specific Fine-Tuning of Multimodal LLMs?
by: An, Xiao, et al.
Published: (2026)

Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation
by: Kim, Sungnyun, et al.
Published: (2025)

UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts
by: Cheng, Zhi-Qi, et al.
Published: (2024)

SpeechEE: A Novel Benchmark for Speech Event Extraction
by: Wang, Bin, et al.
Published: (2024)

DAT: Dual-Aware Adaptive Transmission for Efficient Multimodal LLM Inference in Edge-Cloud Systems
by: Guo, Qi, et al.
Published: (2026)

MMESGBench: Pioneering Multimodal Understanding and Complex Reasoning Benchmark for ESG Tasks
by: Zhang, Lei, et al.
Published: (2025)

SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing
by: Ma, Ziyang, et al.
Published: (2026)

Annotation Techniques for Judo Combat Phase Classification from Tournament Footage
by: Miyaguchi, Anthony, et al.
Published: (2024)

AsCL: An Asymmetry-sensitive Contrastive Learning Method for Image-Text Retrieval with Cross-Modal Fusion
by: Gong, Ziyu, et al.
Published: (2024)

WorldGPT: Empowering LLM as Multimodal World Model
by: Ge, Zhiqi, et al.
Published: (2024)

MIDI-LLaMA: An Instruction-Following Multimodal LLM for Symbolic Music Understanding
by: Yang, Meng, et al.
Published: (2026)

MoTAS: MoE-Guided Feature Selection from TTS-Augmented Speech for Enhanced Multimodal Alzheimer's Early Screening
by: Shao, Yongqi, et al.
Published: (2025)

MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
by: Yeo, Jeong Hun, et al.
Published: (2025)

HistLLM: A Unified Framework for LLM-Based Multimodal Recommendation with User History Encoding and Compression
by: Zhang, Chen, et al.
Published: (2025)

Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment
by: Hong, Joanna, et al.
Published: (2025)

When Drawing Is Not Enough: Exploring Spontaneous Speech with Sketch for Intent Alignment in Multimodal LLMs
by: Shi, Weiyan, et al.
Published: (2026)

Computational Analysis of Stress, Depression and Engagement in Mental Health: A Survey
by: Kumar, Puneet, et al.
Published: (2024)

Multimodal Classification and Out-of-distribution Detection for Multimodal Intent Understanding
by: Zhang, Hanlei, et al.
Published: (2024)

Multimodal Speech Enhancement Using Burst Propagation
by: Raza, Mohsin, et al.
Published: (2022)

Multi-source Multimodal Progressive Domain Adaption for Audio-Visual Deception Detection
by: Lin, Ronghao, et al.
Published: (2025)

Dark Side of Modalities: Reinforced Multimodal Distillation for Multimodal Knowledge Graph Reasoning
by: Zhao, Yu, et al.
Published: (2025)

MInD: Improving Multimodal Sentiment Analysis via Multimodal Information Disentanglement
by: Dai, Weichen, et al.
Published: (2024)

HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation
by: Cheng, Hongye, et al.
Published: (2025)

TalkSketch: Multimodal Generative AI for Real-time Sketch Ideation with Speech
by: Shi, Weiyan, et al.
Published: (2025)

Hyperbolic Multimodal Generative Representation Learning for Generalized Zero-Shot Multimodal Information Extraction
by: Zhou, Baohang, et al.
Published: (2026)

HyperFusion: Hierarchical Multimodal Ensemble Learning for Social Media Popularity Prediction
by: Ye, Liliang, et al.
Published: (2025)