:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhao, Jinghua, Jia, Yuhang, Wang, Shiyao, Zhou, Jiaming, Wang, Hui, Qin, Yong
Format:	Preprint
Published:	2025
Subjects:	Multimedia Artificial Intelligence
Online Access:	https://arxiv.org/abs/2504.15066
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

EmotionTalk: An Interactive Chinese Multimodal Emotion Dataset With Rich Annotations
by: Sun, Haoqin, et al.
Published: (2025)

LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition
by: Kwak, Doyeop, et al.
Published: (2026)

Versatile audio-visual learning for emotion recognition
by: Goncalves, Lucas, et al.
Published: (2023)

An automatic mixing speech enhancement system for multi-track audio
by: Liu, Xiaojing, et al.
Published: (2024)

VCEMO: Multi-Modal Emotion Recognition for Chinese Voiceprints
by: Tang, Jinghua, et al.
Published: (2024)

SlideTailor: Personalized Presentation Slide Generation for Scientific Papers
by: Zeng, Wenzheng, et al.
Published: (2025)

Index-MSR: A high-efficiency multimodal fusion framework for speech recognition
by: Chen, Jinming, et al.
Published: (2025)

TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs
by: Xu, Pengju, et al.
Published: (2025)

ConCLVD: Controllable Chinese Landscape Video Generation via Diffusion Model
by: Liu, Dingming, et al.
Published: (2024)

LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition
by: Hao, Bowen, et al.
Published: (2025)

PoEmotion: Can AI Utilize Chinese Calligraphy to Express Emotion from Poems?
by: Liu, Tiancheng, et al.
Published: (2025)

A vector quantized masked autoencoder for audiovisual speech emotion recognition
by: Sadok, Samir, et al.
Published: (2023)

Enhancing Neural Adaptive Wireless Video Streaming via Lower-Layer Information Exposure and Online Tuning
by: Zhao, Lingzhi, et al.
Published: (2025)

AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines
by: Li, Cancan, et al.
Published: (2025)

RealBench: A Chinese Multi-image Understanding Benchmark Close to Real-world Scenarios
by: Zhao, Fei, et al.
Published: (2025)

Archiving Body Movements: Collective Generation of Chinese Calligraphy
by: Zhou, Aven Le, et al.
Published: (2023)

A Multimodal Transformer for Live Streaming Highlight Prediction
by: Deng, Jiaxin, et al.
Published: (2024)

Chinese for beginners [conjunto] = Ru men han yu / Sun Jun
by: Jun Sun

From Natural Alignment to Conditional Controllability in Multimodal Dialogue
by: Jin, Zeyu, et al.
Published: (2026)

Semi-supervised Chinese Poem-to-Painting Generation via Cycle-consistent Adversarial Networks
by: Lu, Zhengyang, et al.
Published: (2024)

Visual-based spatial audio generation system for multi-speaker environments
by: Liu, Xiaojing, et al.
Published: (2025)

GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation
by: Yang, Quanwei, et al.
Published: (2025)

Shorter Is Different: Characterizing the Dynamics of Short-Form Video Platforms
by: Chen, Zhilong, et al.
Published: (2024)

CPCLDETECTOR: Knowledge Enhancement and Alignment Selection for Chinese Patronizing and Condescending Language Detection
by: Yang, Jiaxun, et al.
Published: (2025)

MMPKUBase: A Comprehensive and High-quality Chinese Multi-modal Knowledge Graph
by: Yi, Xuan, et al.
Published: (2024)

AFL-Net: Integrating Audio, Facial, and Lip Modalities with a Two-step Cross-attention for Robust Speaker Diarization in the Wild
by: Yin, Yongkang, et al.
Published: (2023)

SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description
by: Jin, Zeyu, et al.
Published: (2024)

SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation
by: Ling, Zeyu, et al.
Published: (2025)

Facilitating Daily Practice in Intangible Cultural Heritage through Virtual Reality: A Case Study of Traditional Chinese Flower Arrangement
by: Wang, Yingna, et al.
Published: (2025)

CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning
by: He, Zheqi, et al.
Published: (2024)

Look, Listen and Segment: Towards Weakly Supervised Audio-visual Semantic Segmentation
by: Li, Chengzhi, et al.
Published: (2026)

Enhancing Emotion Recognition in Incomplete Data: A Novel Cross-Modal Alignment, Reconstruction, and Refinement Framework
by: Sun, Haoqin, et al.
Published: (2024)

Omni-Dish: Photorealistic and Faithful Image Generation and Editing for Arbitrary Chinese Dishes
by: Liu, Huijie, et al.
Published: (2025)

Chinese idioms [juego] = Cheng yu gu shi

Towards Automatic Soccer Commentary Generation with Knowledge-Enhanced Visual Reasoning
by: Jin, Zeyu, et al.
Published: (2026)

Fully Automatic Content-Aware Tiling Pipeline for Pathology Whole Slide Images
by: Jabar, Falah, et al.
Published: (2024)

Resolution deficits drive simulator sickness and compromise reading performance in virtual environments
by: Wang, Jialin, et al.
Published: (2026)

DiffBrush:Just Painting the Art by Your Hands
by: Chu, Jiaming, et al.
Published: (2025)

MG-Former: A Transformer-Based Framework for Music-Driven 3D Conducting Gesture Generation
by: Qiu, Ke, et al.
Published: (2026)

Task Presentation and Human Perception in Interactive Video Retrieval
by: Willis, Nina, et al.
Published: (2024)