Saved in:
| Main Authors: | Zhao, Jinghua, Jia, Yuhang, Wang, Shiyao, Zhou, Jiaming, Wang, Hui, Qin, Yong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.15066 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
EmotionTalk: An Interactive Chinese Multimodal Emotion Dataset With Rich Annotations
by: Sun, Haoqin, et al.
Published: (2025)
by: Sun, Haoqin, et al.
Published: (2025)
LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition
by: Kwak, Doyeop, et al.
Published: (2026)
by: Kwak, Doyeop, et al.
Published: (2026)
Versatile audio-visual learning for emotion recognition
by: Goncalves, Lucas, et al.
Published: (2023)
by: Goncalves, Lucas, et al.
Published: (2023)
An automatic mixing speech enhancement system for multi-track audio
by: Liu, Xiaojing, et al.
Published: (2024)
by: Liu, Xiaojing, et al.
Published: (2024)
VCEMO: Multi-Modal Emotion Recognition for Chinese Voiceprints
by: Tang, Jinghua, et al.
Published: (2024)
by: Tang, Jinghua, et al.
Published: (2024)
SlideTailor: Personalized Presentation Slide Generation for Scientific Papers
by: Zeng, Wenzheng, et al.
Published: (2025)
by: Zeng, Wenzheng, et al.
Published: (2025)
Index-MSR: A high-efficiency multimodal fusion framework for speech recognition
by: Chen, Jinming, et al.
Published: (2025)
by: Chen, Jinming, et al.
Published: (2025)
TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs
by: Xu, Pengju, et al.
Published: (2025)
by: Xu, Pengju, et al.
Published: (2025)
ConCLVD: Controllable Chinese Landscape Video Generation via Diffusion Model
by: Liu, Dingming, et al.
Published: (2024)
by: Liu, Dingming, et al.
Published: (2024)
LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition
by: Hao, Bowen, et al.
Published: (2025)
by: Hao, Bowen, et al.
Published: (2025)
PoEmotion: Can AI Utilize Chinese Calligraphy to Express Emotion from Poems?
by: Liu, Tiancheng, et al.
Published: (2025)
by: Liu, Tiancheng, et al.
Published: (2025)
A vector quantized masked autoencoder for audiovisual speech emotion recognition
by: Sadok, Samir, et al.
Published: (2023)
by: Sadok, Samir, et al.
Published: (2023)
Enhancing Neural Adaptive Wireless Video Streaming via Lower-Layer Information Exposure and Online Tuning
by: Zhao, Lingzhi, et al.
Published: (2025)
by: Zhao, Lingzhi, et al.
Published: (2025)
AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines
by: Li, Cancan, et al.
Published: (2025)
by: Li, Cancan, et al.
Published: (2025)
RealBench: A Chinese Multi-image Understanding Benchmark Close to Real-world Scenarios
by: Zhao, Fei, et al.
Published: (2025)
by: Zhao, Fei, et al.
Published: (2025)
Archiving Body Movements: Collective Generation of Chinese Calligraphy
by: Zhou, Aven Le, et al.
Published: (2023)
by: Zhou, Aven Le, et al.
Published: (2023)
A Multimodal Transformer for Live Streaming Highlight Prediction
by: Deng, Jiaxin, et al.
Published: (2024)
by: Deng, Jiaxin, et al.
Published: (2024)
Chinese for beginners [conjunto] = Ru men han yu / Sun Jun
by: Jun Sun
by: Jun Sun
From Natural Alignment to Conditional Controllability in Multimodal Dialogue
by: Jin, Zeyu, et al.
Published: (2026)
by: Jin, Zeyu, et al.
Published: (2026)
Semi-supervised Chinese Poem-to-Painting Generation via Cycle-consistent Adversarial Networks
by: Lu, Zhengyang, et al.
Published: (2024)
by: Lu, Zhengyang, et al.
Published: (2024)
Visual-based spatial audio generation system for multi-speaker environments
by: Liu, Xiaojing, et al.
Published: (2025)
by: Liu, Xiaojing, et al.
Published: (2025)
GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation
by: Yang, Quanwei, et al.
Published: (2025)
by: Yang, Quanwei, et al.
Published: (2025)
Shorter Is Different: Characterizing the Dynamics of Short-Form Video Platforms
by: Chen, Zhilong, et al.
Published: (2024)
by: Chen, Zhilong, et al.
Published: (2024)
CPCLDETECTOR: Knowledge Enhancement and Alignment Selection for Chinese Patronizing and Condescending Language Detection
by: Yang, Jiaxun, et al.
Published: (2025)
by: Yang, Jiaxun, et al.
Published: (2025)
MMPKUBase: A Comprehensive and High-quality Chinese Multi-modal Knowledge Graph
by: Yi, Xuan, et al.
Published: (2024)
by: Yi, Xuan, et al.
Published: (2024)
AFL-Net: Integrating Audio, Facial, and Lip Modalities with a Two-step Cross-attention for Robust Speaker Diarization in the Wild
by: Yin, Yongkang, et al.
Published: (2023)
by: Yin, Yongkang, et al.
Published: (2023)
SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description
by: Jin, Zeyu, et al.
Published: (2024)
by: Jin, Zeyu, et al.
Published: (2024)
SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation
by: Ling, Zeyu, et al.
Published: (2025)
by: Ling, Zeyu, et al.
Published: (2025)
Facilitating Daily Practice in Intangible Cultural Heritage through Virtual Reality: A Case Study of Traditional Chinese Flower Arrangement
by: Wang, Yingna, et al.
Published: (2025)
by: Wang, Yingna, et al.
Published: (2025)
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning
by: He, Zheqi, et al.
Published: (2024)
by: He, Zheqi, et al.
Published: (2024)
Look, Listen and Segment: Towards Weakly Supervised Audio-visual Semantic Segmentation
by: Li, Chengzhi, et al.
Published: (2026)
by: Li, Chengzhi, et al.
Published: (2026)
Enhancing Emotion Recognition in Incomplete Data: A Novel Cross-Modal Alignment, Reconstruction, and Refinement Framework
by: Sun, Haoqin, et al.
Published: (2024)
by: Sun, Haoqin, et al.
Published: (2024)
Omni-Dish: Photorealistic and Faithful Image Generation and Editing for Arbitrary Chinese Dishes
by: Liu, Huijie, et al.
Published: (2025)
by: Liu, Huijie, et al.
Published: (2025)
Chinese idioms [juego] = Cheng yu gu shi
Towards Automatic Soccer Commentary Generation with Knowledge-Enhanced Visual Reasoning
by: Jin, Zeyu, et al.
Published: (2026)
by: Jin, Zeyu, et al.
Published: (2026)
Fully Automatic Content-Aware Tiling Pipeline for Pathology Whole Slide Images
by: Jabar, Falah, et al.
Published: (2024)
by: Jabar, Falah, et al.
Published: (2024)
Resolution deficits drive simulator sickness and compromise reading performance in virtual environments
by: Wang, Jialin, et al.
Published: (2026)
by: Wang, Jialin, et al.
Published: (2026)
DiffBrush:Just Painting the Art by Your Hands
by: Chu, Jiaming, et al.
Published: (2025)
by: Chu, Jiaming, et al.
Published: (2025)
MG-Former: A Transformer-Based Framework for Music-Driven 3D Conducting Gesture Generation
by: Qiu, Ke, et al.
Published: (2026)
by: Qiu, Ke, et al.
Published: (2026)
Task Presentation and Human Perception in Interactive Video Retrieval
by: Willis, Nina, et al.
Published: (2024)
by: Willis, Nina, et al.
Published: (2024)
Similar Items
-
EmotionTalk: An Interactive Chinese Multimodal Emotion Dataset With Rich Annotations
by: Sun, Haoqin, et al.
Published: (2025) -
LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition
by: Kwak, Doyeop, et al.
Published: (2026) -
Versatile audio-visual learning for emotion recognition
by: Goncalves, Lucas, et al.
Published: (2023) -
An automatic mixing speech enhancement system for multi-track audio
by: Liu, Xiaojing, et al.
Published: (2024) -
VCEMO: Multi-Modal Emotion Recognition for Chinese Voiceprints
by: Tang, Jinghua, et al.
Published: (2024)