:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Cheng, Hongye, Wang, Tianyu, Shi, Guangsi, Zhao, Zexing, Fu, Yanwei
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Multimedia
Online Access:	https://arxiv.org/abs/2503.01175
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

PersonaGest: Personalized Co-Speech Gesture Generation with Semantic-Guided Hierarchical Motion Representation
by: Zhao, Junchuan, et al.
Published: (2026)

EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation
by: Qi, Xingqun, et al.
Published: (2023)

I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue
by: Ghaleb, Esam, et al.
Published: (2025)

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
by: He, Xu, et al.
Published: (2024)

Towards Unified Co-Speech Gesture Generation via Hierarchical Implicit Periodicity Learning
by: Guo, Xin, et al.
Published: (2025)

DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder
by: Du, Chenpeng, et al.
Published: (2023)

ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context
by: Zheng, Sixiao, et al.
Published: (2024)

Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion
by: Chen, Sen, et al.
Published: (2022)

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
by: Zhong, Zhisheng, et al.
Published: (2024)

Intelligent Director: An Automatic Framework for Dynamic Visual Composition using ChatGPT
by: Zheng, Sixiao, et al.
Published: (2024)

MagicAnime: A Hierarchically Annotated, Multimodal and Multitasking Dataset with Benchmarks for Cartoon Animation Generation
by: Xu, Shuolin, et al.
Published: (2025)

Co-Reinforcement Learning for Unified Multimodal Understanding and Generation
by: Jiang, Jingjing, et al.
Published: (2025)

Unveiling Deep Shadows: A Survey and Benchmark on Image and Video Shadow Detection, Removal, and Generation in the Deep Learning Era
by: Hu, Xiaowei, et al.
Published: (2024)

PP-Motion: Physical-Perceptual Fidelity Evaluation for Human Motion Generation
by: Zhao, Sihan, et al.
Published: (2025)

MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization
by: Fernandez-Lopez, Adriana, et al.
Published: (2024)

AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation
by: Choi, Jeongsoo, et al.
Published: (2025)

Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation
by: Li, Yunxin, et al.
Published: (2024)

Embedded Heterogeneous Attention Transformer for Cross-lingual Image Captioning
by: Song, Zijie, et al.
Published: (2023)

MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces
by: E, Shaojun, et al.
Published: (2025)

GAIA: Zero-shot Talking Avatar Generation
by: He, Tianyu, et al.
Published: (2023)

A Light-weight Transformer-based Self-supervised Matching Network for Heterogeneous Images
by: Zhang, Wang, et al.
Published: (2024)

ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding
by: Zhang, Zhenxing, et al.
Published: (2024)

Memories are One-to-Many Mapping Alleviators in Talking Face Generation
by: Tang, Anni, et al.
Published: (2022)

HeGraphAdapter: Tuning Multi-Modal Vision-Language Models with Heterogeneous Graph Adapter
by: Zhao, Yumiao, et al.
Published: (2024)

Towards Training-free Multimodal Hate Localisation with Large Language Models
by: Sun, Yueming, et al.
Published: (2026)

ForestHG-Trace: Traceable Long-Horizon Ecological Reasoning over Large-Scale Forest Scenes
by: Cheng, Zihang, et al.
Published: (2026)

Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis
by: Chen, Shuang, et al.
Published: (2026)

URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection
by: Wang, Zhenyu, et al.
Published: (2026)

G3R: Generating Rich and Fine-grained mmWave Radar Data from 2D Videos for Generalized Gesture Recognition
by: Deng, Kaikai, et al.
Published: (2024)

Taming Modality Entanglement in Continual Audio-Visual Segmentation
by: Hong, Yuyang, et al.
Published: (2025)

A Multimodal Transformer for Live Streaming Highlight Prediction
by: Deng, Jiaxin, et al.
Published: (2024)

LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition
by: Hao, Bowen, et al.
Published: (2025)

Latent Reconstruction from Generated Data for Multimodal Misinformation Detection
by: Papadopoulos, Stefanos-Iordanis, et al.
Published: (2025)

AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control
by: Guo, Xinyue, et al.
Published: (2025)

Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing
by: Zhao, Pengcheng, et al.
Published: (2024)

AsyReC: A Multimodal Graph-based Framework for Spatio-Temporal Asymmetric Dyadic Relationship Classification
by: Tang, Wang, et al.
Published: (2025)

SpecFLASH: A Latent-Guided Semi-autoregressive Speculative Decoding Framework for Efficient Multimodal Generation
by: Wang, Zihua, et al.
Published: (2025)

Federated Prompt-Tuning with Heterogeneous and Incomplete Multimodal Client Data
by: Phung, Thu Hang, et al.
Published: (2026)

TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection
by: Yan, Zehong, et al.
Published: (2025)

MHAD: Multimodal Home Activity Dataset with Multi-Angle Videos and Synchronized Physiological Signals
by: Yu, Lei, et al.
Published: (2024)