Saved in:
| Main Authors: | Géré, Léo, Rigaux, Philippe, Audebert, Nicolas |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2407.17536 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
musif: a Python package for symbolic music feature extraction
by: Llorens, Ana, et al.
Published: (2023)
by: Llorens, Ana, et al.
Published: (2023)
Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions
by: Yuan, Yi, et al.
Published: (2024)
by: Yuan, Yi, et al.
Published: (2024)
Improving Speech Enhancement by Integrating Inter-Channel and Band Features with Dual-branch Conformer
by: Li, Jizhen, et al.
Published: (2024)
by: Li, Jizhen, et al.
Published: (2024)
MidiTok Visualizer: a tool for visualization and analysis of tokenized MIDI symbolic music
by: Wiszenko, Michał, et al.
Published: (2024)
by: Wiszenko, Michał, et al.
Published: (2024)
Conformer-based Ultrasound-to-Speech Conversion
by: Ibrahimov, Ibrahim, et al.
Published: (2025)
by: Ibrahimov, Ibrahim, et al.
Published: (2025)
Dance-to-Music Generation with Encoder-based Textual Inversion
by: Li, Sifei, et al.
Published: (2024)
by: Li, Sifei, et al.
Published: (2024)
Exploring compressibility of transformer based text-to-music (TTM) models
by: Moschopoulos, Vasileios, et al.
Published: (2024)
by: Moschopoulos, Vasileios, et al.
Published: (2024)
Attentive-based Multi-level Feature Fusion for Voice Disorder Diagnosis
by: Shen, Lipeng, et al.
Published: (2024)
by: Shen, Lipeng, et al.
Published: (2024)
Visual-based spatial audio generation system for multi-speaker environments
by: Liu, Xiaojing, et al.
Published: (2025)
by: Liu, Xiaojing, et al.
Published: (2025)
REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion
by: Biyani, Ishan D., et al.
Published: (2025)
by: Biyani, Ishan D., et al.
Published: (2025)
Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement
by: Su, Fei, et al.
Published: (2026)
by: Su, Fei, et al.
Published: (2026)
A multimodal dynamical variational autoencoder for audiovisual speech representation learning
by: Sadok, Samir, et al.
Published: (2023)
by: Sadok, Samir, et al.
Published: (2023)
LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition
by: Yu, Fan, et al.
Published: (2024)
by: Yu, Fan, et al.
Published: (2024)
Listening Between the Lines: Synthetic Speech Detection Disregarding Verbal Content
by: Salvi, Davide, et al.
Published: (2024)
by: Salvi, Davide, et al.
Published: (2024)
M6: Multi-generator, Multi-domain, Multi-lingual and cultural, Multi-genres, Multi-instrument Machine-Generated Music Detection Databases
by: Li, Yupei, et al.
Published: (2024)
by: Li, Yupei, et al.
Published: (2024)
Multimodal Emotion Recognition from Raw Audio with Sinc-convolution
by: Zhang, Xiaohui, et al.
Published: (2024)
by: Zhang, Xiaohui, et al.
Published: (2024)
Intelligent Text-Conditioned Music Generation
by: Xie, Zhouyao, et al.
Published: (2024)
by: Xie, Zhouyao, et al.
Published: (2024)
Zero-Shot Fake Video Detection by Audio-Visual Consistency
by: Li, Xiaolou, et al.
Published: (2024)
by: Li, Xiaolou, et al.
Published: (2024)
STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment
by: Ren, Yong, et al.
Published: (2024)
by: Ren, Yong, et al.
Published: (2024)
HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts
by: Niu, Xinlei, et al.
Published: (2024)
by: Niu, Xinlei, et al.
Published: (2024)
EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing
by: Cong, Gaoxiang, et al.
Published: (2024)
by: Cong, Gaoxiang, et al.
Published: (2024)
Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection
by: Huang, Lian, et al.
Published: (2024)
by: Huang, Lian, et al.
Published: (2024)
SonicVisionLM: Playing Sound with Vision Language Models
by: Xie, Zhifeng, et al.
Published: (2024)
by: Xie, Zhifeng, et al.
Published: (2024)
Robust Wake Word Spotting With Frame-Level Cross-Modal Attention Based Audio-Visual Conformer
by: Wang, Haoxu, et al.
Published: (2024)
by: Wang, Haoxu, et al.
Published: (2024)
POLIPHONE: A Dataset for Smartphone Model Identification from Audio Recordings
by: Salvi, Davide, et al.
Published: (2024)
by: Salvi, Davide, et al.
Published: (2024)
CoheDancers: Enhancing Interactive Group Dance Generation through Music-Driven Coherence Decomposition
by: Yang, Kaixing, et al.
Published: (2024)
by: Yang, Kaixing, et al.
Published: (2024)
X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion
by: Sun, Chang, et al.
Published: (2024)
by: Sun, Chang, et al.
Published: (2024)
Flexible Control in Symbolic Music Generation via Musical Metadata
by: Han, Sangjun, et al.
Published: (2024)
by: Han, Sangjun, et al.
Published: (2024)
SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text
by: Liu, Haohe, et al.
Published: (2024)
by: Liu, Haohe, et al.
Published: (2024)
FastTalker: Jointly Generating Speech and Conversational Gestures from Text
by: Guo, Zixin, et al.
Published: (2024)
by: Guo, Zixin, et al.
Published: (2024)
MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
by: Liu, Shansong, et al.
Published: (2024)
by: Liu, Shansong, et al.
Published: (2024)
Human-Inspired Computing for Robust and Efficient Audio-Visual Speech Recognition
by: Liu, Qianhui, et al.
Published: (2024)
by: Liu, Qianhui, et al.
Published: (2024)
MusicAOG: an Energy-Based Model for Learning and Sampling a Hierarchical Representation of Symbolic Music
by: Qian, Yikai, et al.
Published: (2024)
by: Qian, Yikai, et al.
Published: (2024)
Semi-Supervised Contrastive Learning for Controllable Video-to-Music Retrieval
by: Stewart, Shanti, et al.
Published: (2024)
by: Stewart, Shanti, et al.
Published: (2024)
MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence
by: You, Fuming, et al.
Published: (2024)
by: You, Fuming, et al.
Published: (2024)
Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation
by: Guo, Hongming, et al.
Published: (2024)
by: Guo, Hongming, et al.
Published: (2024)
LoVA: Long-form Video-to-Audio Generation
by: Cheng, Xin, et al.
Published: (2024)
by: Cheng, Xin, et al.
Published: (2024)
CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection
by: Zang, Yongyi, et al.
Published: (2024)
by: Zang, Yongyi, et al.
Published: (2024)
DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training
by: Liu, Shengqiang, et al.
Published: (2024)
by: Liu, Shengqiang, et al.
Published: (2024)
Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis
by: Huang, Zhiqi, et al.
Published: (2024)
by: Huang, Zhiqi, et al.
Published: (2024)
Similar Items
-
musif: a Python package for symbolic music feature extraction
by: Llorens, Ana, et al.
Published: (2023) -
Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions
by: Yuan, Yi, et al.
Published: (2024) -
Improving Speech Enhancement by Integrating Inter-Channel and Band Features with Dual-branch Conformer
by: Li, Jizhen, et al.
Published: (2024) -
MidiTok Visualizer: a tool for visualization and analysis of tokenized MIDI symbolic music
by: Wiszenko, Michał, et al.
Published: (2024) -
Conformer-based Ultrasound-to-Speech Conversion
by: Ibrahimov, Ibrahim, et al.
Published: (2025)