:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Han, Zhiyuan, Zhu, Beier, Xu, Yanlong, Song, Peipei, Yang, Xun
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence Computer Vision and Pattern Recognition Multimedia Sound Audio and Speech Processing 68 I.2.10
Online Access:	https://arxiv.org/abs/2508.01181
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering
by: Melechovsky, Jan, et al.
Published: (2025)

Coarse-to-Fine Proposal Refinement Framework for Audio Temporal Forgery Detection and Localization
by: Wu, Junyan, et al.
Published: (2024)

Rethinking Masking Strategies for Masked Prediction-based Audio Self-supervised Learning
by: Niizumi, Daisuke, et al.
Published: (2026)

A Survey on Multimodal Music Emotion Recognition
by: Liyanarachchi, Rashini, et al.
Published: (2025)

Multimodal Emotion Recognition from Raw Audio with Sinc-convolution
by: Zhang, Xiaohui, et al.
Published: (2024)

Can Sound Replace Vision in LLaVA With Token Substitution?
by: Vosoughi, Ali, et al.
Published: (2025)

EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing
by: Cong, Gaoxiang, et al.
Published: (2024)

Emotion-Aligned Contrastive Learning Between Images and Music
by: Stewart, Shanti, et al.
Published: (2023)

Text2midi-InferAlign: Improving Symbolic Music Generation with Inference-Time Alignment
by: Roy, Abhinaba, et al.
Published: (2025)

A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction
by: Cheripally, Sowmya
Published: (2024)

Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Immersive Audiobook Generation
by: Rong, Yan, et al.
Published: (2025)

M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation
by: Niizumi, Daisuke, et al.
Published: (2024)

AI-based Drone Assisted Human Rescue in Disaster Environments: Challenges and Opportunities
by: Papyan, Narek, et al.
Published: (2024)

ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting
by: Zhang, Yu, et al.
Published: (2025)

FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation
by: Behera, Swarup Ranjan, et al.
Published: (2024)

Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
by: Vosoughi, Ali, et al.
Published: (2025)

Revise, Reason, and Recognize: LLM-Based Emotion Recognition via Emotion-Specific Prompts and ASR Error Correction
by: Li, Yuanchao, et al.
Published: (2024)

ChordSync: Conformer-Based Alignment of Chord Annotations to Music Audio
by: Poltronieri, Andrea, et al.
Published: (2024)

Multimodal Emotion Coupling via Speech-to-Facial and Bodily Gestures in Dyadic Interaction
by: Herbuela, Von Ralph Dane Marquez, et al.
Published: (2025)

Addressing Emotion Bias in Music Emotion Recognition and Generation with Frechet Audio Distance
by: Li, Yuanchao, et al.
Published: (2024)

MATER: Multi-level Acoustic and Textual Emotion Representation for Interpretable Speech Emotion Recognition
by: Jon, Hyo Jin, et al.
Published: (2025)

Efficient Feature Extraction and Late Fusion Strategy for Audiovisual Emotional Mimicry Intensity Estimation
by: Yu, Jun, et al.
Published: (2024)

Faked Speech Detection with Zero Prior Knowledge
by: Ajmi, Sahar Al, et al.
Published: (2022)

Sync-TVA: A Graph-Attention Framework for Multimodal Emotion Recognition with Cross-Modal Fusion
by: Deng, Zeyu, et al.
Published: (2025)

Revisiting SSL for sound event detection: complementary fusion and adaptive post-processing
by: Cui, Hanfang, et al.
Published: (2025)

Non-Verbal Vocalisations and their Challenges: Emotion, Privacy, Sparseness, and Real Life
by: Batliner, Anton, et al.
Published: (2025)

MusFlow: Multimodal Music Generation via Conditional Flow Matching
by: Song, Jiahao, et al.
Published: (2025)

MusER: Musical Element-Based Regularization for Generating Symbolic Music with Emotion
by: Ji, Shulei, et al.
Published: (2023)

Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better
by: Ge, Mengying, et al.
Published: (2024)

MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction
by: He, Jiajun, et al.
Published: (2024)

Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions
by: Zhao, Jinzheng, et al.
Published: (2023)

SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition
by: Cheng, Zebang, et al.
Published: (2024)

MSAC: Multiple Speech Attribute Control Method for Reliable Speech Emotion Recognition
by: Pan, Yu, et al.
Published: (2023)

CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection
by: Zang, Yongyi, et al.
Published: (2024)

Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model
by: Ren, Yong, et al.
Published: (2025)

Emotion-Aware Speech Generation with Character-Specific Voices for Comics
by: Qian, Zhiwen, et al.
Published: (2025)

Stimulus Modality Matters: Impact of Perceptual Evaluations from Different Modalities on Speech Emotion Recognition System Performance
by: Chou, Huang-Cheng, et al.
Published: (2024)

SeQuiFi: Mitigating Catastrophic Forgetting in Speech Emotion Recognition with Sequential Class-Finetuning
by: Jain, Sarthak, et al.
Published: (2024)

MART: Learning Hierarchical Music Audio Representations with Part-Whole Transformer
by: Yao, Dong, et al.
Published: (2023)

VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs
by: Zhang, Hezhao, et al.
Published: (2026)