Saved in:
| Main Authors: | Gan, Lindy, Huang, Yifan, Gao, Xiaoyang, Tan, Jiaming, Zhao, Fujun, Yang, Tao |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2501.16813 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens
by: Manakul, Potsawee, et al.
Published: (2026)
by: Manakul, Potsawee, et al.
Published: (2026)
AudioEval: Automatic Dual-Perspective and Multi-Dimensional Evaluation of Text-to-Audio-Generation
by: Wang, Hui, et al.
Published: (2025)
by: Wang, Hui, et al.
Published: (2025)
ALLM4ADD: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection
by: Gu, Hao, et al.
Published: (2025)
by: Gu, Hao, et al.
Published: (2025)
Bridging Language Gaps in Audio-Text Retrieval
by: Yan, Zhiyong, et al.
Published: (2024)
by: Yan, Zhiyong, et al.
Published: (2024)
MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model
by: Huang, Hsiao-Ying, et al.
Published: (2025)
by: Huang, Hsiao-Ying, et al.
Published: (2025)
Enhancing Multimodal Emotion Recognition through Multi-Granularity Cross-Modal Alignment
by: Wang, Xuechen, et al.
Published: (2024)
by: Wang, Xuechen, et al.
Published: (2024)
SAFE-QAQ: End-to-End Slow-Thinking Audio-Text Fraud Detection via Reinforcement Learning
by: Wang, Peidong, et al.
Published: (2026)
by: Wang, Peidong, et al.
Published: (2026)
AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation
by: Manakul, Potsawee, et al.
Published: (2025)
by: Manakul, Potsawee, et al.
Published: (2025)
SceneFake: An Initial Dataset and Benchmarks for Scene Fake Audio Detection
by: Yi, Jiangyan, et al.
Published: (2022)
by: Yi, Jiangyan, et al.
Published: (2022)
StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling
by: Wang, Hui, et al.
Published: (2025)
by: Wang, Hui, et al.
Published: (2025)
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model
by: Xue, Jinlong, et al.
Published: (2024)
by: Xue, Jinlong, et al.
Published: (2024)
Probing Audio-Generation Capabilities of Text-Based Language Models
by: Anbazhagan, Arjun Prasaath, et al.
Published: (2025)
by: Anbazhagan, Arjun Prasaath, et al.
Published: (2025)
BATON: Aligning Text-to-Audio Model with Human Preference Feedback
by: Liao, Huan, et al.
Published: (2024)
by: Liao, Huan, et al.
Published: (2024)
Leveraging Large Language Models for Spontaneous Speech-Based Suicide Risk Detection
by: Gao, Yifan, et al.
Published: (2025)
by: Gao, Yifan, et al.
Published: (2025)
From Contrast to Commonality: Audio Commonality Captioning for Enhanced Audio-Text Cross-modal Understanding in Multimodal LLMs
by: Jia, Yuhang, et al.
Published: (2025)
by: Jia, Yuhang, et al.
Published: (2025)
Unified Audio Event Detection
by: Jiang, Yidi, et al.
Published: (2024)
by: Jiang, Yidi, et al.
Published: (2024)
Generalized Audio Deepfake Detection Using Frame-level Latent Information Entropy
by: Zhao, Botao, et al.
Published: (2025)
by: Zhao, Botao, et al.
Published: (2025)
MiMo-Audio: Audio Language Models are Few-Shot Learners
by: Core Team, et al.
Published: (2025)
by: Core Team, et al.
Published: (2025)
TTA-Bench: A Comprehensive Benchmark for Evaluating Text-to-Audio Models
by: Wang, Hui, et al.
Published: (2025)
by: Wang, Hui, et al.
Published: (2025)
Audios Don't Lie: Multi-Frequency Channel Attention Mechanism for Audio Deepfake Detection
by: Feng, Yangguang
Published: (2024)
by: Feng, Yangguang
Published: (2024)
AudioLCM: Text-to-Audio Generation with Latent Consistency Models
by: Liu, Huadai, et al.
Published: (2024)
by: Liu, Huadai, et al.
Published: (2024)
Audio-Guided Fusion Techniques for Multimodal Emotion Analysis
by: Shi, Pujin, et al.
Published: (2024)
by: Shi, Pujin, et al.
Published: (2024)
ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling
by: Jiang, Yuxuan, et al.
Published: (2025)
by: Jiang, Yuxuan, et al.
Published: (2025)
Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models
by: Yang, Hao, et al.
Published: (2024)
by: Yang, Hao, et al.
Published: (2024)
Guided by the Plan: Enhancing Faithful Autoregressive Text-to-Audio Generation with Guided Decoding
by: Wang, Juncheng, et al.
Published: (2026)
by: Wang, Juncheng, et al.
Published: (2026)
WavMark: Watermarking for Audio Generation
by: Chen, Guangyu, et al.
Published: (2023)
by: Chen, Guangyu, et al.
Published: (2023)
A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks
by: Ishikawa, Takehiro, et al.
Published: (2026)
by: Ishikawa, Takehiro, et al.
Published: (2026)
AUDDT: Audio Unified Deepfake Detection Benchmark Toolkit
by: Zhu, Yi, et al.
Published: (2025)
by: Zhu, Yi, et al.
Published: (2025)
STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence
by: Liu, Zihan, et al.
Published: (2025)
by: Liu, Zihan, et al.
Published: (2025)
FlashAudio: Rectified Flows for Fast and High-Fidelity Text-to-Audio Generation
by: Liu, Huadai, et al.
Published: (2024)
by: Liu, Huadai, et al.
Published: (2024)
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
by: Mei, Xinhao, et al.
Published: (2023)
by: Mei, Xinhao, et al.
Published: (2023)
ICLAD: In-Context Learning with Comparison-Guidance for Audio Deepfake Detection
by: Chou, Benjamin, et al.
Published: (2026)
by: Chou, Benjamin, et al.
Published: (2026)
Step-Audio 2 Technical Report
by: Wu, Boyong, et al.
Published: (2025)
by: Wu, Boyong, et al.
Published: (2025)
Covo-Audio Technical Report
by: Wang, Wenfu, et al.
Published: (2026)
by: Wang, Wenfu, et al.
Published: (2026)
Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
by: Gao, Kuofeng, et al.
Published: (2024)
by: Gao, Kuofeng, et al.
Published: (2024)
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion
by: Chen, Shunian, et al.
Published: (2025)
by: Chen, Shunian, et al.
Published: (2025)
Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition
by: Zhao, Ruoyu, et al.
Published: (2025)
by: Zhao, Ruoyu, et al.
Published: (2025)
On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models
by: Tian, Jinchuan, et al.
Published: (2024)
by: Tian, Jinchuan, et al.
Published: (2024)
Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model
by: Huang, Ailin, et al.
Published: (2025)
by: Huang, Ailin, et al.
Published: (2025)
UniAudio: An Audio Foundation Model Toward Universal Audio Generation
by: Yang, Dongchao, et al.
Published: (2023)
by: Yang, Dongchao, et al.
Published: (2023)
Similar Items
-
Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens
by: Manakul, Potsawee, et al.
Published: (2026) -
AudioEval: Automatic Dual-Perspective and Multi-Dimensional Evaluation of Text-to-Audio-Generation
by: Wang, Hui, et al.
Published: (2025) -
ALLM4ADD: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection
by: Gu, Hao, et al.
Published: (2025) -
Bridging Language Gaps in Audio-Text Retrieval
by: Yan, Zhiyong, et al.
Published: (2024) -
MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model
by: Huang, Hsiao-Ying, et al.
Published: (2025)