:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Gan, Lindy, Huang, Yifan, Gao, Xiaoyang, Tan, Jiaming, Zhao, Fujun, Yang, Tao
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2501.16813
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens
by: Manakul, Potsawee, et al.
Published: (2026)

AudioEval: Automatic Dual-Perspective and Multi-Dimensional Evaluation of Text-to-Audio-Generation
by: Wang, Hui, et al.
Published: (2025)

ALLM4ADD: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection
by: Gu, Hao, et al.
Published: (2025)

Bridging Language Gaps in Audio-Text Retrieval
by: Yan, Zhiyong, et al.
Published: (2024)

MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model
by: Huang, Hsiao-Ying, et al.
Published: (2025)

Enhancing Multimodal Emotion Recognition through Multi-Granularity Cross-Modal Alignment
by: Wang, Xuechen, et al.
Published: (2024)

SAFE-QAQ: End-to-End Slow-Thinking Audio-Text Fraud Detection via Reinforcement Learning
by: Wang, Peidong, et al.
Published: (2026)

AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation
by: Manakul, Potsawee, et al.
Published: (2025)

SceneFake: An Initial Dataset and Benchmarks for Scene Fake Audio Detection
by: Yi, Jiangyan, et al.
Published: (2022)

StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling
by: Wang, Hui, et al.
Published: (2025)

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model
by: Xue, Jinlong, et al.
Published: (2024)

Probing Audio-Generation Capabilities of Text-Based Language Models
by: Anbazhagan, Arjun Prasaath, et al.
Published: (2025)

BATON: Aligning Text-to-Audio Model with Human Preference Feedback
by: Liao, Huan, et al.
Published: (2024)

Leveraging Large Language Models for Spontaneous Speech-Based Suicide Risk Detection
by: Gao, Yifan, et al.
Published: (2025)

From Contrast to Commonality: Audio Commonality Captioning for Enhanced Audio-Text Cross-modal Understanding in Multimodal LLMs
by: Jia, Yuhang, et al.
Published: (2025)

Unified Audio Event Detection
by: Jiang, Yidi, et al.
Published: (2024)

Generalized Audio Deepfake Detection Using Frame-level Latent Information Entropy
by: Zhao, Botao, et al.
Published: (2025)

MiMo-Audio: Audio Language Models are Few-Shot Learners
by: Core Team, et al.
Published: (2025)

TTA-Bench: A Comprehensive Benchmark for Evaluating Text-to-Audio Models
by: Wang, Hui, et al.
Published: (2025)

Audios Don't Lie: Multi-Frequency Channel Attention Mechanism for Audio Deepfake Detection
by: Feng, Yangguang
Published: (2024)

AudioLCM: Text-to-Audio Generation with Latent Consistency Models
by: Liu, Huadai, et al.
Published: (2024)

Audio-Guided Fusion Techniques for Multimodal Emotion Analysis
by: Shi, Pujin, et al.
Published: (2024)

ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling
by: Jiang, Yuxuan, et al.
Published: (2025)

Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models
by: Yang, Hao, et al.
Published: (2024)

Guided by the Plan: Enhancing Faithful Autoregressive Text-to-Audio Generation with Guided Decoding
by: Wang, Juncheng, et al.
Published: (2026)

WavMark: Watermarking for Audio Generation
by: Chen, Guangyu, et al.
Published: (2023)

A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks
by: Ishikawa, Takehiro, et al.
Published: (2026)

AUDDT: Audio Unified Deepfake Detection Benchmark Toolkit
by: Zhu, Yi, et al.
Published: (2025)

STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence
by: Liu, Zihan, et al.
Published: (2025)

FlashAudio: Rectified Flows for Fast and High-Fidelity Text-to-Audio Generation
by: Liu, Huadai, et al.
Published: (2024)

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
by: Mei, Xinhao, et al.
Published: (2023)

ICLAD: In-Context Learning with Comparison-Guidance for Audio Deepfake Detection
by: Chou, Benjamin, et al.
Published: (2026)

Step-Audio 2 Technical Report
by: Wu, Boyong, et al.
Published: (2025)

Covo-Audio Technical Report
by: Wang, Wenfu, et al.
Published: (2026)

Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
by: Gao, Kuofeng, et al.
Published: (2024)

FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion
by: Chen, Shunian, et al.
Published: (2025)

Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition
by: Zhao, Ruoyu, et al.
Published: (2025)

On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models
by: Tian, Jinchuan, et al.
Published: (2024)

Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model
by: Huang, Ailin, et al.
Published: (2025)

UniAudio: An Audio Foundation Model Toward Universal Audio Generation
by: Yang, Dongchao, et al.
Published: (2023)