Saved in:
| Main Authors: | Hirvonen, Toni, Namazi, Mahmoud |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2411.12008 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
HARP: A Large-Scale Higher-Order Ambisonic Room Impulse Response Dataset
by: Saini, Shivam, et al.
Published: (2024)
by: Saini, Shivam, et al.
Published: (2024)
Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement
by: Bandyopadhyay, Tathagata
Published: (2024)
by: Bandyopadhyay, Tathagata
Published: (2024)
Siamese Residual Neural Network for Musical Shape Evaluation in Piano Performance Assessment
by: Li, Xiaoquan, et al.
Published: (2024)
by: Li, Xiaoquan, et al.
Published: (2024)
Efficient Feature Extraction and Late Fusion Strategy for Audiovisual Emotional Mimicry Intensity Estimation
by: Yu, Jun, et al.
Published: (2024)
by: Yu, Jun, et al.
Published: (2024)
A Recurrent Neural Network Approach to the Answering Machine Detection Problem
by: Altwlkany, Kemal, et al.
Published: (2024)
by: Altwlkany, Kemal, et al.
Published: (2024)
Source Separation of Multi-source Raw Music using a Residual Quantized Variational Autoencoder
by: Berti, Leonardo
Published: (2024)
by: Berti, Leonardo
Published: (2024)
CHORDONOMICON: A Dataset of 666,000 Songs and their Chord Progressions
by: Kantarelis, Spyridon, et al.
Published: (2024)
by: Kantarelis, Spyridon, et al.
Published: (2024)
Audiopedia: Audio QA with Knowledge
by: Penamakuri, Abhirama Subramanyam, et al.
Published: (2024)
by: Penamakuri, Abhirama Subramanyam, et al.
Published: (2024)
Leveraging LLM Embeddings for Cross Dataset Label Alignment and Zero Shot Music Emotion Prediction
by: Liu, Renhang, et al.
Published: (2024)
by: Liu, Renhang, et al.
Published: (2024)
LSTMSE-Net: Long Short Term Speech Enhancement Network for Audio-visual Speech Enhancement
by: Jain, Arnav, et al.
Published: (2024)
by: Jain, Arnav, et al.
Published: (2024)
Unified Microphone Conversion: Many-to-Many Device Mapping via Feature-wise Linear Modulation
by: Ryu, Myeonghoon, et al.
Published: (2024)
by: Ryu, Myeonghoon, et al.
Published: (2024)
A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation
by: Ishii, Masato, et al.
Published: (2024)
by: Ishii, Masato, et al.
Published: (2024)
MidiCaps: A large-scale MIDI dataset with text captions
by: Melechovsky, Jan, et al.
Published: (2024)
by: Melechovsky, Jan, et al.
Published: (2024)
Microphone Conversion: Mitigating Device Variability in Sound Event Classification
by: Ryu, Myeonghoon, et al.
Published: (2024)
by: Ryu, Myeonghoon, et al.
Published: (2024)
Just Label the Repeats for In-The-Wild Audio-to-Score Alignment
by: Bukey, Irmak, et al.
Published: (2024)
by: Bukey, Irmak, et al.
Published: (2024)
Music102: An $D_{12}$-equivariant transformer for chord progression accompaniment
by: Luo, Weiliang
Published: (2024)
by: Luo, Weiliang
Published: (2024)
Network Bending of Diffusion Models for Audio-Visual Generation
by: Dzwonczyk, Luke, et al.
Published: (2024)
by: Dzwonczyk, Luke, et al.
Published: (2024)
Speech Separation with Pretrained Frontend to Minimize Domain Mismatch
by: Wang, Wupeng, et al.
Published: (2024)
by: Wang, Wupeng, et al.
Published: (2024)
AWARE: Audio Watermarking with Adversarial Resistance to Edits
by: Pavlović, Kosta, et al.
Published: (2025)
by: Pavlović, Kosta, et al.
Published: (2025)
Bridging The Multi-Modality Gaps of Audio, Visual and Linguistic for Speech Enhancement
by: Lin, Meng-Ping, et al.
Published: (2025)
by: Lin, Meng-Ping, et al.
Published: (2025)
Multimodal Emotion Coupling via Speech-to-Facial and Bodily Gestures in Dyadic Interaction
by: Herbuela, Von Ralph Dane Marquez, et al.
Published: (2025)
by: Herbuela, Von Ralph Dane Marquez, et al.
Published: (2025)
Revisit Modality Imbalance at the Decision Layer
by: Ma, Xiaoyu, et al.
Published: (2025)
by: Ma, Xiaoyu, et al.
Published: (2025)
Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction
by: Wu, Wenxuan, et al.
Published: (2025)
by: Wu, Wenxuan, et al.
Published: (2025)
ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
by: Bai, Yatong, et al.
Published: (2023)
by: Bai, Yatong, et al.
Published: (2023)
Multimodal Speech Enhancement Using Burst Propagation
by: Raza, Mohsin, et al.
Published: (2022)
by: Raza, Mohsin, et al.
Published: (2022)
$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction
by: Wu, Wenxuan, et al.
Published: (2025)
by: Wu, Wenxuan, et al.
Published: (2025)
BERT-like Pre-training for Symbolic Piano Music Classification Tasks
by: Chou, Yi-Hui, et al.
Published: (2021)
by: Chou, Yi-Hui, et al.
Published: (2021)
MIDI-GPT: A Controllable Generative Model for Computer-Assisted Multitrack Music Composition
by: Pasquier, Philippe, et al.
Published: (2025)
by: Pasquier, Philippe, et al.
Published: (2025)
Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers
by: Wang, Juncheng, et al.
Published: (2025)
by: Wang, Juncheng, et al.
Published: (2025)
A Traditional Approach to Symbolic Piano Continuation
by: Zhou-Zheng, Christian, et al.
Published: (2025)
by: Zhou-Zheng, Christian, et al.
Published: (2025)
Versatile audio-visual learning for emotion recognition
by: Goncalves, Lucas, et al.
Published: (2023)
by: Goncalves, Lucas, et al.
Published: (2023)
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation
by: Kim, Sungnyun, et al.
Published: (2025)
by: Kim, Sungnyun, et al.
Published: (2025)
IML-Spikeformer: Input-aware Multi-Level Spiking Transformer for Speech Processing
by: Song, Zeyang, et al.
Published: (2025)
by: Song, Zeyang, et al.
Published: (2025)
A multimodal dynamical variational autoencoder for audiovisual speech representation learning
by: Sadok, Samir, et al.
Published: (2023)
by: Sadok, Samir, et al.
Published: (2023)
A vector quantized masked autoencoder for audiovisual speech emotion recognition
by: Sadok, Samir, et al.
Published: (2023)
by: Sadok, Samir, et al.
Published: (2023)
Leveraging Pre-Trained Models for Multimodal Class-Incremental Learning under Adaptive Fusion
by: Chen, Yukun, et al.
Published: (2025)
by: Chen, Yukun, et al.
Published: (2025)
Improving BERT for Symbolic Music Understanding Using Token Denoising and Pianoroll Prediction
by: Wang, Jun-You, et al.
Published: (2025)
by: Wang, Jun-You, et al.
Published: (2025)
Multimodal Dataset Normalization and Perceptual Validation for Music-Taste Correspondences
by: Spanio, Matteo, et al.
Published: (2026)
by: Spanio, Matteo, et al.
Published: (2026)
Tiny is not small enough: High-quality, low-resource facial animation models through hybrid knowledge distillation
by: Han, Zhen, et al.
Published: (2025)
by: Han, Zhen, et al.
Published: (2025)
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
by: Ji, Shengpeng, et al.
Published: (2024)
by: Ji, Shengpeng, et al.
Published: (2024)
Similar Items
-
HARP: A Large-Scale Higher-Order Ambisonic Room Impulse Response Dataset
by: Saini, Shivam, et al.
Published: (2024) -
Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement
by: Bandyopadhyay, Tathagata
Published: (2024) -
Siamese Residual Neural Network for Musical Shape Evaluation in Piano Performance Assessment
by: Li, Xiaoquan, et al.
Published: (2024) -
Efficient Feature Extraction and Late Fusion Strategy for Audiovisual Emotional Mimicry Intensity Estimation
by: Yu, Jun, et al.
Published: (2024) -
A Recurrent Neural Network Approach to the Answering Machine Detection Problem
by: Altwlkany, Kemal, et al.
Published: (2024)