Saved in:
| Main Authors: | Han, Zhiyuan, Zhu, Beier, Xu, Yanlong, Song, Peipei, Yang, Xun |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.01181 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering
by: Melechovsky, Jan, et al.
Published: (2025)
by: Melechovsky, Jan, et al.
Published: (2025)
Coarse-to-Fine Proposal Refinement Framework for Audio Temporal Forgery Detection and Localization
by: Wu, Junyan, et al.
Published: (2024)
by: Wu, Junyan, et al.
Published: (2024)
Rethinking Masking Strategies for Masked Prediction-based Audio Self-supervised Learning
by: Niizumi, Daisuke, et al.
Published: (2026)
by: Niizumi, Daisuke, et al.
Published: (2026)
A Survey on Multimodal Music Emotion Recognition
by: Liyanarachchi, Rashini, et al.
Published: (2025)
by: Liyanarachchi, Rashini, et al.
Published: (2025)
Multimodal Emotion Recognition from Raw Audio with Sinc-convolution
by: Zhang, Xiaohui, et al.
Published: (2024)
by: Zhang, Xiaohui, et al.
Published: (2024)
Can Sound Replace Vision in LLaVA With Token Substitution?
by: Vosoughi, Ali, et al.
Published: (2025)
by: Vosoughi, Ali, et al.
Published: (2025)
EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing
by: Cong, Gaoxiang, et al.
Published: (2024)
by: Cong, Gaoxiang, et al.
Published: (2024)
Emotion-Aligned Contrastive Learning Between Images and Music
by: Stewart, Shanti, et al.
Published: (2023)
by: Stewart, Shanti, et al.
Published: (2023)
Text2midi-InferAlign: Improving Symbolic Music Generation with Inference-Time Alignment
by: Roy, Abhinaba, et al.
Published: (2025)
by: Roy, Abhinaba, et al.
Published: (2025)
A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction
by: Cheripally, Sowmya
Published: (2024)
by: Cheripally, Sowmya
Published: (2024)
Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Immersive Audiobook Generation
by: Rong, Yan, et al.
Published: (2025)
by: Rong, Yan, et al.
Published: (2025)
M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation
by: Niizumi, Daisuke, et al.
Published: (2024)
by: Niizumi, Daisuke, et al.
Published: (2024)
AI-based Drone Assisted Human Rescue in Disaster Environments: Challenges and Opportunities
by: Papyan, Narek, et al.
Published: (2024)
by: Papyan, Narek, et al.
Published: (2024)
ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting
by: Zhang, Yu, et al.
Published: (2025)
by: Zhang, Yu, et al.
Published: (2025)
FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation
by: Behera, Swarup Ranjan, et al.
Published: (2024)
by: Behera, Swarup Ranjan, et al.
Published: (2024)
Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
by: Vosoughi, Ali, et al.
Published: (2025)
by: Vosoughi, Ali, et al.
Published: (2025)
Revise, Reason, and Recognize: LLM-Based Emotion Recognition via Emotion-Specific Prompts and ASR Error Correction
by: Li, Yuanchao, et al.
Published: (2024)
by: Li, Yuanchao, et al.
Published: (2024)
ChordSync: Conformer-Based Alignment of Chord Annotations to Music Audio
by: Poltronieri, Andrea, et al.
Published: (2024)
by: Poltronieri, Andrea, et al.
Published: (2024)
Multimodal Emotion Coupling via Speech-to-Facial and Bodily Gestures in Dyadic Interaction
by: Herbuela, Von Ralph Dane Marquez, et al.
Published: (2025)
by: Herbuela, Von Ralph Dane Marquez, et al.
Published: (2025)
Addressing Emotion Bias in Music Emotion Recognition and Generation with Frechet Audio Distance
by: Li, Yuanchao, et al.
Published: (2024)
by: Li, Yuanchao, et al.
Published: (2024)
MATER: Multi-level Acoustic and Textual Emotion Representation for Interpretable Speech Emotion Recognition
by: Jon, Hyo Jin, et al.
Published: (2025)
by: Jon, Hyo Jin, et al.
Published: (2025)
Efficient Feature Extraction and Late Fusion Strategy for Audiovisual Emotional Mimicry Intensity Estimation
by: Yu, Jun, et al.
Published: (2024)
by: Yu, Jun, et al.
Published: (2024)
Faked Speech Detection with Zero Prior Knowledge
by: Ajmi, Sahar Al, et al.
Published: (2022)
by: Ajmi, Sahar Al, et al.
Published: (2022)
Sync-TVA: A Graph-Attention Framework for Multimodal Emotion Recognition with Cross-Modal Fusion
by: Deng, Zeyu, et al.
Published: (2025)
by: Deng, Zeyu, et al.
Published: (2025)
Revisiting SSL for sound event detection: complementary fusion and adaptive post-processing
by: Cui, Hanfang, et al.
Published: (2025)
by: Cui, Hanfang, et al.
Published: (2025)
Non-Verbal Vocalisations and their Challenges: Emotion, Privacy, Sparseness, and Real Life
by: Batliner, Anton, et al.
Published: (2025)
by: Batliner, Anton, et al.
Published: (2025)
MusFlow: Multimodal Music Generation via Conditional Flow Matching
by: Song, Jiahao, et al.
Published: (2025)
by: Song, Jiahao, et al.
Published: (2025)
MusER: Musical Element-Based Regularization for Generating Symbolic Music with Emotion
by: Ji, Shulei, et al.
Published: (2023)
by: Ji, Shulei, et al.
Published: (2023)
Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better
by: Ge, Mengying, et al.
Published: (2024)
by: Ge, Mengying, et al.
Published: (2024)
MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction
by: He, Jiajun, et al.
Published: (2024)
by: He, Jiajun, et al.
Published: (2024)
Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions
by: Zhao, Jinzheng, et al.
Published: (2023)
by: Zhao, Jinzheng, et al.
Published: (2023)
SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition
by: Cheng, Zebang, et al.
Published: (2024)
by: Cheng, Zebang, et al.
Published: (2024)
MSAC: Multiple Speech Attribute Control Method for Reliable Speech Emotion Recognition
by: Pan, Yu, et al.
Published: (2023)
by: Pan, Yu, et al.
Published: (2023)
CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection
by: Zang, Yongyi, et al.
Published: (2024)
by: Zang, Yongyi, et al.
Published: (2024)
Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model
by: Ren, Yong, et al.
Published: (2025)
by: Ren, Yong, et al.
Published: (2025)
Emotion-Aware Speech Generation with Character-Specific Voices for Comics
by: Qian, Zhiwen, et al.
Published: (2025)
by: Qian, Zhiwen, et al.
Published: (2025)
Stimulus Modality Matters: Impact of Perceptual Evaluations from Different Modalities on Speech Emotion Recognition System Performance
by: Chou, Huang-Cheng, et al.
Published: (2024)
by: Chou, Huang-Cheng, et al.
Published: (2024)
SeQuiFi: Mitigating Catastrophic Forgetting in Speech Emotion Recognition with Sequential Class-Finetuning
by: Jain, Sarthak, et al.
Published: (2024)
by: Jain, Sarthak, et al.
Published: (2024)
MART: Learning Hierarchical Music Audio Representations with Part-Whole Transformer
by: Yao, Dong, et al.
Published: (2023)
by: Yao, Dong, et al.
Published: (2023)
VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs
by: Zhang, Hezhao, et al.
Published: (2026)
by: Zhang, Hezhao, et al.
Published: (2026)
Similar Items
-
SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering
by: Melechovsky, Jan, et al.
Published: (2025) -
Coarse-to-Fine Proposal Refinement Framework for Audio Temporal Forgery Detection and Localization
by: Wu, Junyan, et al.
Published: (2024) -
Rethinking Masking Strategies for Masked Prediction-based Audio Self-supervised Learning
by: Niizumi, Daisuke, et al.
Published: (2026) -
A Survey on Multimodal Music Emotion Recognition
by: Liyanarachchi, Rashini, et al.
Published: (2025) -
Multimodal Emotion Recognition from Raw Audio with Sinc-convolution
by: Zhang, Xiaohui, et al.
Published: (2024)