Saved in:
| Main Authors: | Elgiriyewithana, Nidula, Kodikara, N. D. |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.04949 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification
by: Zhu, Wentao
Published: (2024)
by: Zhu, Wentao
Published: (2024)
Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows
by: Mo, Shentong, et al.
Published: (2026)
by: Mo, Shentong, et al.
Published: (2026)
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification
by: Zhu, Wentao
Published: (2024)
by: Zhu, Wentao
Published: (2024)
Synthesizing Audio from Silent Video using Sequence to Sequence Modeling
by: Belinchon, Hugo Garrido-Lestache, et al.
Published: (2024)
by: Belinchon, Hugo Garrido-Lestache, et al.
Published: (2024)
Spectral and Rhythm Features for Audio Classification with Deep Convolutional Neural Networks
by: Wolf-Monheim, Friedrich
Published: (2024)
by: Wolf-Monheim, Friedrich
Published: (2024)
From Vision to Sound: Advancing Audio Anomaly Detection with Vision-Based Algorithms
by: Barusco, Manuel, et al.
Published: (2025)
by: Barusco, Manuel, et al.
Published: (2025)
Integrating Audio Narrations to Strengthen Domain Generalization in Multimodal First-Person Action Recognition
by: Gungor, Cagri, et al.
Published: (2024)
by: Gungor, Cagri, et al.
Published: (2024)
GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining
by: Mo, Shentong, et al.
Published: (2026)
by: Mo, Shentong, et al.
Published: (2026)
DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis
by: Ahmad, Zeeshan, et al.
Published: (2025)
by: Ahmad, Zeeshan, et al.
Published: (2025)
Tell What You Hear From What You See -- Video to Audio Generation Through Text
by: Liu, Xiulong, et al.
Published: (2024)
by: Liu, Xiulong, et al.
Published: (2024)
Spectral and Rhythm Feature Performance Evaluation for Category and Class Level Audio Classification with Deep Convolutional Neural Networks
by: Wolf-Monheim, Friedrich
Published: (2025)
by: Wolf-Monheim, Friedrich
Published: (2025)
3DFacePolicy: Audio-Driven 3D Facial Animation Based on Action Control
by: Sha, Xuanmeng, et al.
Published: (2024)
by: Sha, Xuanmeng, et al.
Published: (2024)
Text-to-Audio Generation Synchronized with Videos
by: Mo, Shentong, et al.
Published: (2024)
by: Mo, Shentong, et al.
Published: (2024)
AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection
by: Hashmi, Ammarah, et al.
Published: (2023)
by: Hashmi, Ammarah, et al.
Published: (2023)
Diffusion-based Unsupervised Audio-visual Speech Enhancement
by: Ayilo, Jean-Eudes, et al.
Published: (2024)
by: Ayilo, Jean-Eudes, et al.
Published: (2024)
Unified Video-Language Pre-training with Synchronized Audio
by: Mo, Shentong, et al.
Published: (2024)
by: Mo, Shentong, et al.
Published: (2024)
Aligning Audio-Visual Joint Representations with an Agentic Workflow
by: Mo, Shentong, et al.
Published: (2024)
by: Mo, Shentong, et al.
Published: (2024)
SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing
by: Chen, Mingfei, et al.
Published: (2025)
by: Chen, Mingfei, et al.
Published: (2025)
Dynamic Cross Attention for Audio-Visual Person Verification
by: Praveen, R. Gnana, et al.
Published: (2024)
by: Praveen, R. Gnana, et al.
Published: (2024)
Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection
by: Joo, Seohyun, et al.
Published: (2026)
by: Joo, Seohyun, et al.
Published: (2026)
Continual Audio-Visual Sound Separation
by: Pian, Weiguo, et al.
Published: (2024)
by: Pian, Weiguo, et al.
Published: (2024)
Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation
by: Zhou, Jinxing, et al.
Published: (2026)
by: Zhou, Jinxing, et al.
Published: (2026)
Investigating the Invertibility of Multimodal Latent Spaces: Limitations of Optimization-Based Methods
by: Park, Siwoo
Published: (2025)
by: Park, Siwoo
Published: (2025)
WhaleNet: a Novel Deep Learning Architecture for Marine Mammals Vocalizations on Watkins Marine Mammal Sound Database
by: Licciardi, Alessandro, et al.
Published: (2024)
by: Licciardi, Alessandro, et al.
Published: (2024)
MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
by: Takahashi, Akira, et al.
Published: (2025)
by: Takahashi, Akira, et al.
Published: (2025)
What's Making That Sound Right Now? Video-centric Audio-Visual Localization
by: Choi, Hahyeon, et al.
Published: (2025)
by: Choi, Hahyeon, et al.
Published: (2025)
Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer
by: Wang, Yaoting, et al.
Published: (2023)
by: Wang, Yaoting, et al.
Published: (2023)
Sounding that Object: Interactive Object-Aware Image to Audio Generation
by: Li, Tingle, et al.
Published: (2025)
by: Li, Tingle, et al.
Published: (2025)
Contextual Cross-Modal Attention for Audio-Visual Deepfake Detection and Localization
by: Katamneni, Vinaya Sree, et al.
Published: (2024)
by: Katamneni, Vinaya Sree, et al.
Published: (2024)
A Comprehensive Survey of Hallucination in Large Language, Image, Video and Audio Foundation Models
by: Sahoo, Pranab, et al.
Published: (2024)
by: Sahoo, Pranab, et al.
Published: (2024)
Hearing Anywhere in Any Environment
by: Liu, Xiulong, et al.
Published: (2025)
by: Liu, Xiulong, et al.
Published: (2025)
Data Augmentation Using Neural Acoustic Fields With Retrieval-Augmented Pre-training
by: Ick, Christopher, et al.
Published: (2025)
by: Ick, Christopher, et al.
Published: (2025)
Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models
by: Lin, Zhiqiu, et al.
Published: (2023)
by: Lin, Zhiqiu, et al.
Published: (2023)
InfantCryNet: A Data-driven Framework for Intelligent Analysis of Infant Cries
by: Hong, Mengze, et al.
Published: (2024)
by: Hong, Mengze, et al.
Published: (2024)
SONICS: Synthetic Or Not -- Identifying Counterfeit Songs
by: Rahman, Md Awsafur, et al.
Published: (2024)
by: Rahman, Md Awsafur, et al.
Published: (2024)
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
by: Xie, Zhifei, et al.
Published: (2024)
by: Xie, Zhifei, et al.
Published: (2024)
The Effect of Perceptual Metrics on Music Representation Learning for Genre Classification
by: Namgyal, Tashi, et al.
Published: (2024)
by: Namgyal, Tashi, et al.
Published: (2024)
Direction-Aware Neural Acoustic Fields for Few-Shot Interpolation of Ambisonic Impulse Responses
by: Ick, Christopher, et al.
Published: (2025)
by: Ick, Christopher, et al.
Published: (2025)
NOTA: Multimodal Music Notation Understanding for Visual Large Language Model
by: Tang, Mingni, et al.
Published: (2025)
by: Tang, Mingni, et al.
Published: (2025)
VisionScores -- A system-segmented image score dataset for deep learning tasks
by: Amezcua, Alejandro Romero, et al.
Published: (2025)
by: Amezcua, Alejandro Romero, et al.
Published: (2025)
Similar Items
-
Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification
by: Zhu, Wentao
Published: (2024) -
Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows
by: Mo, Shentong, et al.
Published: (2026) -
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification
by: Zhu, Wentao
Published: (2024) -
Synthesizing Audio from Silent Video using Sequence to Sequence Modeling
by: Belinchon, Hugo Garrido-Lestache, et al.
Published: (2024) -
Spectral and Rhythm Features for Audio Classification with Deep Convolutional Neural Networks
by: Wolf-Monheim, Friedrich
Published: (2024)