:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Hirvonen, Toni, Namazi, Mahmoud
Format:	Preprint
Published:	2024
Subjects:	Sound Machine Learning Multimedia Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2411.12008
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

HARP: A Large-Scale Higher-Order Ambisonic Room Impulse Response Dataset
by: Saini, Shivam, et al.
Published: (2024)

Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement
by: Bandyopadhyay, Tathagata
Published: (2024)

Siamese Residual Neural Network for Musical Shape Evaluation in Piano Performance Assessment
by: Li, Xiaoquan, et al.
Published: (2024)

Efficient Feature Extraction and Late Fusion Strategy for Audiovisual Emotional Mimicry Intensity Estimation
by: Yu, Jun, et al.
Published: (2024)

A Recurrent Neural Network Approach to the Answering Machine Detection Problem
by: Altwlkany, Kemal, et al.
Published: (2024)

Source Separation of Multi-source Raw Music using a Residual Quantized Variational Autoencoder
by: Berti, Leonardo
Published: (2024)

CHORDONOMICON: A Dataset of 666,000 Songs and their Chord Progressions
by: Kantarelis, Spyridon, et al.
Published: (2024)

Audiopedia: Audio QA with Knowledge
by: Penamakuri, Abhirama Subramanyam, et al.
Published: (2024)

Leveraging LLM Embeddings for Cross Dataset Label Alignment and Zero Shot Music Emotion Prediction
by: Liu, Renhang, et al.
Published: (2024)

LSTMSE-Net: Long Short Term Speech Enhancement Network for Audio-visual Speech Enhancement
by: Jain, Arnav, et al.
Published: (2024)

Unified Microphone Conversion: Many-to-Many Device Mapping via Feature-wise Linear Modulation
by: Ryu, Myeonghoon, et al.
Published: (2024)

A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation
by: Ishii, Masato, et al.
Published: (2024)

MidiCaps: A large-scale MIDI dataset with text captions
by: Melechovsky, Jan, et al.
Published: (2024)

Microphone Conversion: Mitigating Device Variability in Sound Event Classification
by: Ryu, Myeonghoon, et al.
Published: (2024)

Just Label the Repeats for In-The-Wild Audio-to-Score Alignment
by: Bukey, Irmak, et al.
Published: (2024)

Music102: An $D_{12}$-equivariant transformer for chord progression accompaniment
by: Luo, Weiliang
Published: (2024)

Network Bending of Diffusion Models for Audio-Visual Generation
by: Dzwonczyk, Luke, et al.
Published: (2024)

Speech Separation with Pretrained Frontend to Minimize Domain Mismatch
by: Wang, Wupeng, et al.
Published: (2024)

AWARE: Audio Watermarking with Adversarial Resistance to Edits
by: Pavlović, Kosta, et al.
Published: (2025)

Bridging The Multi-Modality Gaps of Audio, Visual and Linguistic for Speech Enhancement
by: Lin, Meng-Ping, et al.
Published: (2025)

Multimodal Emotion Coupling via Speech-to-Facial and Bodily Gestures in Dyadic Interaction
by: Herbuela, Von Ralph Dane Marquez, et al.
Published: (2025)

Revisit Modality Imbalance at the Decision Layer
by: Ma, Xiaoyu, et al.
Published: (2025)

Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction
by: Wu, Wenxuan, et al.
Published: (2025)

ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
by: Bai, Yatong, et al.
Published: (2023)

Multimodal Speech Enhancement Using Burst Propagation
by: Raza, Mohsin, et al.
Published: (2022)

$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction
by: Wu, Wenxuan, et al.
Published: (2025)

BERT-like Pre-training for Symbolic Piano Music Classification Tasks
by: Chou, Yi-Hui, et al.
Published: (2021)

MIDI-GPT: A Controllable Generative Model for Computer-Assisted Multitrack Music Composition
by: Pasquier, Philippe, et al.
Published: (2025)

Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers
by: Wang, Juncheng, et al.
Published: (2025)

A Traditional Approach to Symbolic Piano Continuation
by: Zhou-Zheng, Christian, et al.
Published: (2025)

Versatile audio-visual learning for emotion recognition
by: Goncalves, Lucas, et al.
Published: (2023)

Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation
by: Kim, Sungnyun, et al.
Published: (2025)

IML-Spikeformer: Input-aware Multi-Level Spiking Transformer for Speech Processing
by: Song, Zeyang, et al.
Published: (2025)

A multimodal dynamical variational autoencoder for audiovisual speech representation learning
by: Sadok, Samir, et al.
Published: (2023)

A vector quantized masked autoencoder for audiovisual speech emotion recognition
by: Sadok, Samir, et al.
Published: (2023)

Leveraging Pre-Trained Models for Multimodal Class-Incremental Learning under Adaptive Fusion
by: Chen, Yukun, et al.
Published: (2025)

Improving BERT for Symbolic Music Understanding Using Token Denoising and Pianoroll Prediction
by: Wang, Jun-You, et al.
Published: (2025)

Multimodal Dataset Normalization and Perceptual Validation for Music-Taste Correspondences
by: Spanio, Matteo, et al.
Published: (2026)

Tiny is not small enough: High-quality, low-resource facial animation models through hybrid knowledge distillation
by: Han, Zhen, et al.
Published: (2025)

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
by: Ji, Shengpeng, et al.
Published: (2024)