Saved in:
| Main Authors: | Cheng, Hongye, Wang, Tianyu, Shi, Guangsi, Zhao, Zexing, Fu, Yanwei |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.01175 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
PersonaGest: Personalized Co-Speech Gesture Generation with Semantic-Guided Hierarchical Motion Representation
by: Zhao, Junchuan, et al.
Published: (2026)
by: Zhao, Junchuan, et al.
Published: (2026)
EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation
by: Qi, Xingqun, et al.
Published: (2023)
by: Qi, Xingqun, et al.
Published: (2023)
I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue
by: Ghaleb, Esam, et al.
Published: (2025)
by: Ghaleb, Esam, et al.
Published: (2025)
Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
by: He, Xu, et al.
Published: (2024)
by: He, Xu, et al.
Published: (2024)
Towards Unified Co-Speech Gesture Generation via Hierarchical Implicit Periodicity Learning
by: Guo, Xin, et al.
Published: (2025)
by: Guo, Xin, et al.
Published: (2025)
DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder
by: Du, Chenpeng, et al.
Published: (2023)
by: Du, Chenpeng, et al.
Published: (2023)
ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context
by: Zheng, Sixiao, et al.
Published: (2024)
by: Zheng, Sixiao, et al.
Published: (2024)
Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion
by: Chen, Sen, et al.
Published: (2022)
by: Chen, Sen, et al.
Published: (2022)
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
by: Zhong, Zhisheng, et al.
Published: (2024)
by: Zhong, Zhisheng, et al.
Published: (2024)
Intelligent Director: An Automatic Framework for Dynamic Visual Composition using ChatGPT
by: Zheng, Sixiao, et al.
Published: (2024)
by: Zheng, Sixiao, et al.
Published: (2024)
MagicAnime: A Hierarchically Annotated, Multimodal and Multitasking Dataset with Benchmarks for Cartoon Animation Generation
by: Xu, Shuolin, et al.
Published: (2025)
by: Xu, Shuolin, et al.
Published: (2025)
Co-Reinforcement Learning for Unified Multimodal Understanding and Generation
by: Jiang, Jingjing, et al.
Published: (2025)
by: Jiang, Jingjing, et al.
Published: (2025)
Unveiling Deep Shadows: A Survey and Benchmark on Image and Video Shadow Detection, Removal, and Generation in the Deep Learning Era
by: Hu, Xiaowei, et al.
Published: (2024)
by: Hu, Xiaowei, et al.
Published: (2024)
PP-Motion: Physical-Perceptual Fidelity Evaluation for Human Motion Generation
by: Zhao, Sihan, et al.
Published: (2025)
by: Zhao, Sihan, et al.
Published: (2025)
MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization
by: Fernandez-Lopez, Adriana, et al.
Published: (2024)
by: Fernandez-Lopez, Adriana, et al.
Published: (2024)
AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation
by: Choi, Jeongsoo, et al.
Published: (2025)
by: Choi, Jeongsoo, et al.
Published: (2025)
Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation
by: Li, Yunxin, et al.
Published: (2024)
by: Li, Yunxin, et al.
Published: (2024)
Embedded Heterogeneous Attention Transformer for Cross-lingual Image Captioning
by: Song, Zijie, et al.
Published: (2023)
by: Song, Zijie, et al.
Published: (2023)
MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces
by: E, Shaojun, et al.
Published: (2025)
by: E, Shaojun, et al.
Published: (2025)
GAIA: Zero-shot Talking Avatar Generation
by: He, Tianyu, et al.
Published: (2023)
by: He, Tianyu, et al.
Published: (2023)
A Light-weight Transformer-based Self-supervised Matching Network for Heterogeneous Images
by: Zhang, Wang, et al.
Published: (2024)
by: Zhang, Wang, et al.
Published: (2024)
ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding
by: Zhang, Zhenxing, et al.
Published: (2024)
by: Zhang, Zhenxing, et al.
Published: (2024)
Memories are One-to-Many Mapping Alleviators in Talking Face Generation
by: Tang, Anni, et al.
Published: (2022)
by: Tang, Anni, et al.
Published: (2022)
HeGraphAdapter: Tuning Multi-Modal Vision-Language Models with Heterogeneous Graph Adapter
by: Zhao, Yumiao, et al.
Published: (2024)
by: Zhao, Yumiao, et al.
Published: (2024)
Towards Training-free Multimodal Hate Localisation with Large Language Models
by: Sun, Yueming, et al.
Published: (2026)
by: Sun, Yueming, et al.
Published: (2026)
ForestHG-Trace: Traceable Long-Horizon Ecological Reasoning over Large-Scale Forest Scenes
by: Cheng, Zihang, et al.
Published: (2026)
by: Cheng, Zihang, et al.
Published: (2026)
Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis
by: Chen, Shuang, et al.
Published: (2026)
by: Chen, Shuang, et al.
Published: (2026)
URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection
by: Wang, Zhenyu, et al.
Published: (2026)
by: Wang, Zhenyu, et al.
Published: (2026)
G3R: Generating Rich and Fine-grained mmWave Radar Data from 2D Videos for Generalized Gesture Recognition
by: Deng, Kaikai, et al.
Published: (2024)
by: Deng, Kaikai, et al.
Published: (2024)
Taming Modality Entanglement in Continual Audio-Visual Segmentation
by: Hong, Yuyang, et al.
Published: (2025)
by: Hong, Yuyang, et al.
Published: (2025)
A Multimodal Transformer for Live Streaming Highlight Prediction
by: Deng, Jiaxin, et al.
Published: (2024)
by: Deng, Jiaxin, et al.
Published: (2024)
LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition
by: Hao, Bowen, et al.
Published: (2025)
by: Hao, Bowen, et al.
Published: (2025)
Latent Reconstruction from Generated Data for Multimodal Misinformation Detection
by: Papadopoulos, Stefanos-Iordanis, et al.
Published: (2025)
by: Papadopoulos, Stefanos-Iordanis, et al.
Published: (2025)
AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control
by: Guo, Xinyue, et al.
Published: (2025)
by: Guo, Xinyue, et al.
Published: (2025)
Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing
by: Zhao, Pengcheng, et al.
Published: (2024)
by: Zhao, Pengcheng, et al.
Published: (2024)
AsyReC: A Multimodal Graph-based Framework for Spatio-Temporal Asymmetric Dyadic Relationship Classification
by: Tang, Wang, et al.
Published: (2025)
by: Tang, Wang, et al.
Published: (2025)
SpecFLASH: A Latent-Guided Semi-autoregressive Speculative Decoding Framework for Efficient Multimodal Generation
by: Wang, Zihua, et al.
Published: (2025)
by: Wang, Zihua, et al.
Published: (2025)
Federated Prompt-Tuning with Heterogeneous and Incomplete Multimodal Client Data
by: Phung, Thu Hang, et al.
Published: (2026)
by: Phung, Thu Hang, et al.
Published: (2026)
TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection
by: Yan, Zehong, et al.
Published: (2025)
by: Yan, Zehong, et al.
Published: (2025)
MHAD: Multimodal Home Activity Dataset with Multi-Angle Videos and Synchronized Physiological Signals
by: Yu, Lei, et al.
Published: (2024)
by: Yu, Lei, et al.
Published: (2024)
Similar Items
-
PersonaGest: Personalized Co-Speech Gesture Generation with Semantic-Guided Hierarchical Motion Representation
by: Zhao, Junchuan, et al.
Published: (2026) -
EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation
by: Qi, Xingqun, et al.
Published: (2023) -
I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue
by: Ghaleb, Esam, et al.
Published: (2025) -
Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
by: He, Xu, et al.
Published: (2024) -
Towards Unified Co-Speech Gesture Generation via Hierarchical Implicit Periodicity Learning
by: Guo, Xin, et al.
Published: (2025)