:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ai, Zhiqi, Bao, Meixuan, Chen, Zhiyong, Yang, Zhi, Li, Xinnuo, Xu, Shugong
Format:	Preprint
Published:	2025
Subjects:	Sound Artificial Intelligence Computer Vision and Pattern Recognition Multimedia
Online Access:	https://arxiv.org/abs/2505.21445
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Enhancing Open-Set Speaker Identification through Rapid Tuning with Speaker Reciprocal Points and Negative Sample
by: Chen, Zhiyong, et al.
Published: (2024)

StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis
by: Chen, Zhiyong, et al.
Published: (2024)

AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines
by: Li, Cancan, et al.
Published: (2025)

VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling
by: Marmor, Yanir, et al.
Published: (2026)

EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities
by: Qian, Xinyuan, et al.
Published: (2026)

Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions
by: Zhao, Jinzheng, et al.
Published: (2023)

Effective User-defined Keyword Spotting with Dual-stage Matching, Multi-modal Enrollment, and Continual Adaptation
by: Ai, Zhiqi, et al.
Published: (2026)

Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment
by: Rinaldi, Ivan, et al.
Published: (2026)

Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis
by: Huang, Zhiqi, et al.
Published: (2024)

LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition
by: Kwak, Doyeop, et al.
Published: (2026)

Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning
by: Sun, Luoyi, et al.
Published: (2023)

Robust Active Speaker Detection in Noisy Environments
by: Vasireddy, Siva Sai Nagender, et al.
Published: (2024)

MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting
by: Ai, Zhiqi, et al.
Published: (2024)

Ensembling Synchronisation-based and Face-Voice Association Paradigms for Robust Active Speaker Detection in Egocentric Recordings
by: Clarke, Jason, et al.
Published: (2025)

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV
by: Liu, Tengfei, et al.
Published: (2026)

MeMo: Attentional Momentum for Real-time Audio-visual Speaker Extraction under Impaired Visual Conditions
by: Li, Junjie, et al.
Published: (2025)

Jamendo-QA: A Large-Scale Music Question Answering Dataset
by: Koh, Junyoung, et al.
Published: (2025)

ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
by: Yang, Jianxuan, et al.
Published: (2026)

OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset
by: Park, Jeongkyun, et al.
Published: (2023)

MDD: A Dataset for Text-and-Music Conditioned Duet Dance Generation
by: Gupta, Prerit, et al.
Published: (2025)

Tempo as the Stable Cue: Hierarchical Mixture of Tempo and Beat Experts for Music to 3D Dance Generation
by: Lyu, Guangtao, et al.
Published: (2025)

AudioStory: Generating Long-Form Narrative Audio with Large Language Models
by: Guo, Yuxin, et al.
Published: (2025)

MoMuSE: Momentum Multi-modal Target Speaker Extraction for Real-time Scenarios with Impaired Visual Cues
by: Li, Junjie, et al.
Published: (2024)

AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control
by: Guo, Xinyue, et al.
Published: (2025)

MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization
by: Yang, Jianxuan, et al.
Published: (2025)

UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
by: Li, Hebeizi, et al.
Published: (2026)

Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness
by: Yang, Sicheng, et al.
Published: (2024)

Aligned Better, Listen Better for Audio-Visual Large Language Models
by: Guo, Yuxin, et al.
Published: (2025)

MMED: A Multimodal Micro-Expression Dataset based on Audio-Visual Fusion
by: Wang, Junbo, et al.
Published: (2025)

READ-Net: Clarifying Emotional Ambiguity via Adaptive Feature Recalibration for Audio-Visual Depression Detection
by: Chen, Chenglizhao, et al.
Published: (2026)

ReactMotion: Generating Reactive Listener Motions from Speaker Utterance
by: Luo, Cheng, et al.
Published: (2026)

Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal
by: Xu, Weihan, et al.
Published: (2025)

VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs
by: Zhang, Hezhao, et al.
Published: (2026)

AMuSE: Adaptive Multimodal Analysis for Speaker Emotion Recognition in Group Conversations
by: Devulapally, Naresh Kumar, et al.
Published: (2024)

Noise-Conditioned Mixture-of-Experts Framework for Robust Speaker Verification
by: Gu, Bin, et al.
Published: (2025)

G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition
by: Peng, Jing, et al.
Published: (2026)

MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation
by: Wu, Shih-Lun, et al.
Published: (2025)

DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility
by: Liu, Yifan, et al.
Published: (2025)

MusicWeaver: Composer-Style Structural Editing and Minute-Scale Coherent Music Generation
by: Wang, Xuanchen, et al.
Published: (2025)

Face-Voice Association for Audiovisual Active Speaker Detection in Egocentric Recordings
by: Clarke, Jason, et al.
Published: (2025)