Saved in:
| Main Author: | Kafentzis, George P. |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2401.01255 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining
by: Liu, Haohe, et al.
Published: (2023)
by: Liu, Haohe, et al.
Published: (2023)
Learning Temporal Resolution in Spectrogram for Audio Classification
by: Liu, Haohe, et al.
Published: (2022)
by: Liu, Haohe, et al.
Published: (2022)
SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound
by: Liu, Haohe, et al.
Published: (2024)
by: Liu, Haohe, et al.
Published: (2024)
Recent Advances in Discrete Speech Tokens: A Review
by: Guo, Yiwei, et al.
Published: (2025)
by: Guo, Yiwei, et al.
Published: (2025)
Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion
by: Lim, DongHoon, et al.
Published: (2025)
by: Lim, DongHoon, et al.
Published: (2025)
Analyzable Chain-of-Musical-Thought Prompting for High-Fidelity Music Generation
by: Lam, Max W. Y., et al.
Published: (2025)
by: Lam, Max W. Y., et al.
Published: (2025)
StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech
by: Lou, Haowei, et al.
Published: (2024)
by: Lou, Haowei, et al.
Published: (2024)
FoVNet: Configurable Field-of-View Speech Enhancement with Low Computation and Distortion for Smart Glasses
by: Xu, Zhongweiyang, et al.
Published: (2024)
by: Xu, Zhongweiyang, et al.
Published: (2024)
Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech?
by: Kumar, Lokesh, et al.
Published: (2026)
by: Kumar, Lokesh, et al.
Published: (2026)
Embedding Alignment in Code Generation for Audio
by: Kouteili, Sam, et al.
Published: (2025)
by: Kouteili, Sam, et al.
Published: (2025)
Stimulus Modality Matters: Impact of Perceptual Evaluations from Different Modalities on Speech Emotion Recognition System Performance
by: Chou, Huang-Cheng, et al.
Published: (2024)
by: Chou, Huang-Cheng, et al.
Published: (2024)
Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition
by: Liu, Rui, et al.
Published: (2025)
by: Liu, Rui, et al.
Published: (2025)
Analyzing the Impact of Splicing Artifacts in Partially Fake Speech Signals
by: Negroni, Viola, et al.
Published: (2024)
by: Negroni, Viola, et al.
Published: (2024)
AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models
by: Li, Wenyu, et al.
Published: (2025)
by: Li, Wenyu, et al.
Published: (2025)
OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model
by: Li, Maomao, et al.
Published: (2026)
by: Li, Maomao, et al.
Published: (2026)
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
by: Ji, Shengpeng, et al.
Published: (2024)
by: Ji, Shengpeng, et al.
Published: (2024)
Speak the Art: A Direct Speech to Image Generation Framework
by: Saeed, Mariam, et al.
Published: (2025)
by: Saeed, Mariam, et al.
Published: (2025)
Tuberculosis Screening from Cough Audio: Baseline Models, Clinical Variables, and Uncertainty Quantification
by: Kafentzis, George P., et al.
Published: (2026)
by: Kafentzis, George P., et al.
Published: (2026)
FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation
by: Jiang, Yuxuan, et al.
Published: (2025)
by: Jiang, Yuxuan, et al.
Published: (2025)
SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering
by: Yang, Zhe, et al.
Published: (2024)
by: Yang, Zhe, et al.
Published: (2024)
Retrieval-Augmented Text-to-Audio Generation
by: Yuan, Yi, et al.
Published: (2023)
by: Yuan, Yi, et al.
Published: (2023)
Neural Style Transfer for Audio Spectograms
by: Verma, Prateek, et al.
Published: (2018)
by: Verma, Prateek, et al.
Published: (2018)
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
by: Choi, Jeongsoo, et al.
Published: (2023)
by: Choi, Jeongsoo, et al.
Published: (2023)
Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation
by: Zhang, Kang, et al.
Published: (2025)
by: Zhang, Kang, et al.
Published: (2025)
MSAC: Multiple Speech Attribute Control Method for Reliable Speech Emotion Recognition
by: Pan, Yu, et al.
Published: (2023)
by: Pan, Yu, et al.
Published: (2023)
ArrayDPS: Unsupervised Blind Speech Separation with a Diffusion Prior
by: Xu, Zhongweiyang, et al.
Published: (2025)
by: Xu, Zhongweiyang, et al.
Published: (2025)
Integrating IP Broadcasting with Audio Tags: Workflow and Challenges
by: Burchett-Vass, Rhys, et al.
Published: (2024)
by: Burchett-Vass, Rhys, et al.
Published: (2024)
Unveiling Visual Biases in Audio-Visual Localization Benchmarks
by: Chen, Liangyu, et al.
Published: (2024)
by: Chen, Liangyu, et al.
Published: (2024)
PIAST: A Multimodal Piano Dataset with Audio, Symbolic and Text
by: Bang, Hayeon, et al.
Published: (2024)
by: Bang, Hayeon, et al.
Published: (2024)
Towards Generating Diverse Audio Captions via Adversarial Training
by: Mei, Xinhao, et al.
Published: (2022)
by: Mei, Xinhao, et al.
Published: (2022)
Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio
by: Chen, Guangke, et al.
Published: (2025)
by: Chen, Guangke, et al.
Published: (2025)
Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition
by: Wu, Linzhi, et al.
Published: (2026)
by: Wu, Linzhi, et al.
Published: (2026)
Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning of Music Audio
by: Alonso-Jiménez, Pablo, et al.
Published: (2024)
by: Alonso-Jiménez, Pablo, et al.
Published: (2024)
DeepFake Doctor: Diagnosing and Treating Audio-Video Fake Detection
by: Klemt, Marcel, et al.
Published: (2025)
by: Klemt, Marcel, et al.
Published: (2025)
LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport
by: Rho, Kyeongha, et al.
Published: (2025)
by: Rho, Kyeongha, et al.
Published: (2025)
On the Design of Diffusion-based Neural Speech Codecs
by: Foti, Pietro, et al.
Published: (2025)
by: Foti, Pietro, et al.
Published: (2025)
Leveraging Pre-trained AudioLDM for Sound Generation: A Benchmark Study
by: Yuan, Yi, et al.
Published: (2023)
by: Yuan, Yi, et al.
Published: (2023)
Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectrograms
by: Wen, Penghui, et al.
Published: (2023)
by: Wen, Penghui, et al.
Published: (2023)
Towards Assessing Data Replication in Music Generation with Music Similarity Metrics on Raw Audio
by: Batlle-Roca, Roser, et al.
Published: (2024)
by: Batlle-Roca, Roser, et al.
Published: (2024)
Zero-Shot Cognitive Impairment Detection from Speech Using AudioLLM
by: Shahin, Mostafa, et al.
Published: (2025)
by: Shahin, Mostafa, et al.
Published: (2025)
Similar Items
-
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining
by: Liu, Haohe, et al.
Published: (2023) -
Learning Temporal Resolution in Spectrogram for Audio Classification
by: Liu, Haohe, et al.
Published: (2022) -
SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound
by: Liu, Haohe, et al.
Published: (2024) -
Recent Advances in Discrete Speech Tokens: A Review
by: Guo, Yiwei, et al.
Published: (2025) -
Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion
by: Lim, DongHoon, et al.
Published: (2025)