:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Géré, Léo, Rigaux, Philippe, Audebert, Nicolas
Format:	Preprint
Published:	2024
Subjects:	Sound Multimedia Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2407.17536
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

musif: a Python package for symbolic music feature extraction
by: Llorens, Ana, et al.
Published: (2023)

Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions
by: Yuan, Yi, et al.
Published: (2024)

Improving Speech Enhancement by Integrating Inter-Channel and Band Features with Dual-branch Conformer
by: Li, Jizhen, et al.
Published: (2024)

MidiTok Visualizer: a tool for visualization and analysis of tokenized MIDI symbolic music
by: Wiszenko, Michał, et al.
Published: (2024)

Conformer-based Ultrasound-to-Speech Conversion
by: Ibrahimov, Ibrahim, et al.
Published: (2025)

Dance-to-Music Generation with Encoder-based Textual Inversion
by: Li, Sifei, et al.
Published: (2024)

Exploring compressibility of transformer based text-to-music (TTM) models
by: Moschopoulos, Vasileios, et al.
Published: (2024)

Attentive-based Multi-level Feature Fusion for Voice Disorder Diagnosis
by: Shen, Lipeng, et al.
Published: (2024)

Visual-based spatial audio generation system for multi-speaker environments
by: Liu, Xiaojing, et al.
Published: (2025)

REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion
by: Biyani, Ishan D., et al.
Published: (2025)

Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement
by: Su, Fei, et al.
Published: (2026)

A multimodal dynamical variational autoencoder for audiovisual speech representation learning
by: Sadok, Samir, et al.
Published: (2023)

LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition
by: Yu, Fan, et al.
Published: (2024)

Listening Between the Lines: Synthetic Speech Detection Disregarding Verbal Content
by: Salvi, Davide, et al.
Published: (2024)

M6: Multi-generator, Multi-domain, Multi-lingual and cultural, Multi-genres, Multi-instrument Machine-Generated Music Detection Databases
by: Li, Yupei, et al.
Published: (2024)

Multimodal Emotion Recognition from Raw Audio with Sinc-convolution
by: Zhang, Xiaohui, et al.
Published: (2024)

Intelligent Text-Conditioned Music Generation
by: Xie, Zhouyao, et al.
Published: (2024)

Zero-Shot Fake Video Detection by Audio-Visual Consistency
by: Li, Xiaolou, et al.
Published: (2024)

STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment
by: Ren, Yong, et al.
Published: (2024)

HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts
by: Niu, Xinlei, et al.
Published: (2024)

EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing
by: Cong, Gaoxiang, et al.
Published: (2024)

Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection
by: Huang, Lian, et al.
Published: (2024)

SonicVisionLM: Playing Sound with Vision Language Models
by: Xie, Zhifeng, et al.
Published: (2024)

Robust Wake Word Spotting With Frame-Level Cross-Modal Attention Based Audio-Visual Conformer
by: Wang, Haoxu, et al.
Published: (2024)

POLIPHONE: A Dataset for Smartphone Model Identification from Audio Recordings
by: Salvi, Davide, et al.
Published: (2024)

CoheDancers: Enhancing Interactive Group Dance Generation through Music-Driven Coherence Decomposition
by: Yang, Kaixing, et al.
Published: (2024)

X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion
by: Sun, Chang, et al.
Published: (2024)

Flexible Control in Symbolic Music Generation via Musical Metadata
by: Han, Sangjun, et al.
Published: (2024)

SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text
by: Liu, Haohe, et al.
Published: (2024)

FastTalker: Jointly Generating Speech and Conversational Gestures from Text
by: Guo, Zixin, et al.
Published: (2024)

MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
by: Liu, Shansong, et al.
Published: (2024)

Human-Inspired Computing for Robust and Efficient Audio-Visual Speech Recognition
by: Liu, Qianhui, et al.
Published: (2024)

MusicAOG: an Energy-Based Model for Learning and Sampling a Hierarchical Representation of Symbolic Music
by: Qian, Yikai, et al.
Published: (2024)

Semi-Supervised Contrastive Learning for Controllable Video-to-Music Retrieval
by: Stewart, Shanti, et al.
Published: (2024)

MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence
by: You, Fuming, et al.
Published: (2024)

Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation
by: Guo, Hongming, et al.
Published: (2024)

LoVA: Long-form Video-to-Audio Generation
by: Cheng, Xin, et al.
Published: (2024)

CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection
by: Zang, Yongyi, et al.
Published: (2024)

DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training
by: Liu, Shengqiang, et al.
Published: (2024)

Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis
by: Huang, Zhiqi, et al.
Published: (2024)