:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Bingyuan, Zhang, Xulong, Cheng, Ning, Yu, Jun, Xiao, Jing, Wang, Jianzong
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2401.08049
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis
by: Tang, Haobin, et al.
Published: (2024)

RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis
by: Shi, Haoxiang, et al.
Published: (2024)

EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization
by: Wang, Jianzong, et al.
Published: (2024)

Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation
by: Deng, Yimin, et al.
Published: (2024)

CONTUNER: Singing Voice Beautifying with Pitch and Expressiveness Condition
by: Wang, Jianzong, et al.
Published: (2024)

DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation
by: Wang, Jianzong, et al.
Published: (2023)

EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning
by: Liang, Ziqi, et al.
Published: (2024)

Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval
by: Deng, Yimin, et al.
Published: (2024)

ESARM: 3D Emotional Speech-to-Animation via Reward Model from Automatically-Ranked Demonstrations
by: Zhang, Xulong, et al.
Published: (2024)

Attention-weighted Centered Kernel Alignment for Knowledge Distillation in Large Audio-Language Models Applied to Speech Emotion Recognition
by: Yang, Qingran, et al.
Published: (2026)

MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion
by: Li, Pengcheng, et al.
Published: (2024)

CycleFlow: Leveraging Cycle Consistency in Flow Matching for Speaker Style Adaptation
by: Liang, Ziqi, et al.
Published: (2025)

Improving Controllability and Editability for Pretrained Text-to-Music Generation Models
by: Zhang, Yixiao
Published: (2024)

EmoFake: An Initial Dataset for Emotion Fake Audio Detection
by: Zhao, Yan, et al.
Published: (2022)

EMO-RL: Emotion-Rule-Based Reinforcement Learning Enhanced Audio-Language Model for Generalized Speech Emotion Recognition
by: Li, Pengcheng, et al.
Published: (2025)

EmoQ: Speech Emotion Recognition via Speech-Aware Q-Former and Large Language Model
by: Yang, Yiqing, et al.
Published: (2025)

SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR
by: Fan, Zhiyun, et al.
Published: (2024)

Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization
by: Gao, Xiaoxue, et al.
Published: (2024)

EmoOmni: Bridging Emotional Understanding and Expression in Omni-Modal LLMs
by: Tian, Wenjie, et al.
Published: (2026)

IDEAW: Robust Neural Audio Watermarking with Invertible Dual-Embedding
by: Li, Pengcheng, et al.
Published: (2024)

FastTalker: Jointly Generating Speech and Conversational Gestures from Text
by: Guo, Zixin, et al.
Published: (2024)

Generalized Audio Deepfake Detection Using Frame-level Latent Information Entropy
by: Zhao, Botao, et al.
Published: (2025)

A Unified Neural Codec Language Model for Selective Editable Text to Speech Generation
by: Pei, Hanchen, et al.
Published: (2026)

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling
by: Ye, Zhen, et al.
Published: (2026)

Semi-Supervised Self-Learning Enhanced Music Emotion Recognition
by: Sun, Yifu, et al.
Published: (2024)

EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech
by: Cho, Deok-Hyeon, et al.
Published: (2024)

Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling
by: Cao, Junjie, et al.
Published: (2025)

EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing
by: Cong, Gaoxiang, et al.
Published: (2024)

EmoSpeech: A Corpus of Emotionally Rich and Contextually Detailed Speech Annotations
by: Bian, Weizhen, et al.
Published: (2024)

EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector
by: Cho, Deok-Hyeon, et al.
Published: (2024)

EmoFormer: A Text-Independent Speech Emotion Recognition using a Hybrid Transformer-CNN model
by: Hasan, Rashedul, et al.
Published: (2025)

Scaling Multi-Talker ASR with Speaker-Agnostic Activity Streams
by: He, Xiluo, et al.
Published: (2025)

EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering
by: Xie, Tianxin, et al.
Published: (2025)

Warm Chat: Diffuse Emotion-aware Interactive Talking Head Avatar with Tree-Structured Guidance
by: Yang, Haijie, et al.
Published: (2025)

Rare Word Recognition and Translation Without Fine-Tuning via Task Vector in Speech Models
by: Jing, Ruihao, et al.
Published: (2025)

Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis
by: Hu, Yifan, et al.
Published: (2025)

EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion
by: Gudmalwar, Ashishkumar, et al.
Published: (2024)

VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs
by: Zhang, Hezhao, et al.
Published: (2026)

EmoTech: A Multi-modal Speech Emotion Recognition Using Multi-source Low-level Information with Hybrid Recurrent Network
by: Avro, Shamin Bin Habib, et al.
Published: (2025)

DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech
by: Cho, Deok-Hyeon, et al.
Published: (2025)