:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Elgiriyewithana, Nidula, Kodikara, N. D.
Format:	Preprint
Published:	2024
Subjects:	Sound Artificial Intelligence Computer Vision and Pattern Recognition Machine Learning Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2409.04949
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification
by: Zhu, Wentao
Published: (2024)

Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows
by: Mo, Shentong, et al.
Published: (2026)

Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification
by: Zhu, Wentao
Published: (2024)

Synthesizing Audio from Silent Video using Sequence to Sequence Modeling
by: Belinchon, Hugo Garrido-Lestache, et al.
Published: (2024)

Spectral and Rhythm Features for Audio Classification with Deep Convolutional Neural Networks
by: Wolf-Monheim, Friedrich
Published: (2024)

From Vision to Sound: Advancing Audio Anomaly Detection with Vision-Based Algorithms
by: Barusco, Manuel, et al.
Published: (2025)

Integrating Audio Narrations to Strengthen Domain Generalization in Multimodal First-Person Action Recognition
by: Gungor, Cagri, et al.
Published: (2024)

GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining
by: Mo, Shentong, et al.
Published: (2026)

DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis
by: Ahmad, Zeeshan, et al.
Published: (2025)

Tell What You Hear From What You See -- Video to Audio Generation Through Text
by: Liu, Xiulong, et al.
Published: (2024)

Spectral and Rhythm Feature Performance Evaluation for Category and Class Level Audio Classification with Deep Convolutional Neural Networks
by: Wolf-Monheim, Friedrich
Published: (2025)

3DFacePolicy: Audio-Driven 3D Facial Animation Based on Action Control
by: Sha, Xuanmeng, et al.
Published: (2024)

Text-to-Audio Generation Synchronized with Videos
by: Mo, Shentong, et al.
Published: (2024)

AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection
by: Hashmi, Ammarah, et al.
Published: (2023)

Diffusion-based Unsupervised Audio-visual Speech Enhancement
by: Ayilo, Jean-Eudes, et al.
Published: (2024)

Unified Video-Language Pre-training with Synchronized Audio
by: Mo, Shentong, et al.
Published: (2024)

Aligning Audio-Visual Joint Representations with an Agentic Workflow
by: Mo, Shentong, et al.
Published: (2024)

SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing
by: Chen, Mingfei, et al.
Published: (2025)

Dynamic Cross Attention for Audio-Visual Person Verification
by: Praveen, R. Gnana, et al.
Published: (2024)

Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection
by: Joo, Seohyun, et al.
Published: (2026)

Continual Audio-Visual Sound Separation
by: Pian, Weiguo, et al.
Published: (2024)

Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation
by: Zhou, Jinxing, et al.
Published: (2026)

Investigating the Invertibility of Multimodal Latent Spaces: Limitations of Optimization-Based Methods
by: Park, Siwoo
Published: (2025)

WhaleNet: a Novel Deep Learning Architecture for Marine Mammals Vocalizations on Watkins Marine Mammal Sound Database
by: Licciardi, Alessandro, et al.
Published: (2024)

MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
by: Takahashi, Akira, et al.
Published: (2025)

What's Making That Sound Right Now? Video-centric Audio-Visual Localization
by: Choi, Hahyeon, et al.
Published: (2025)

Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer
by: Wang, Yaoting, et al.
Published: (2023)

Sounding that Object: Interactive Object-Aware Image to Audio Generation
by: Li, Tingle, et al.
Published: (2025)

Contextual Cross-Modal Attention for Audio-Visual Deepfake Detection and Localization
by: Katamneni, Vinaya Sree, et al.
Published: (2024)

A Comprehensive Survey of Hallucination in Large Language, Image, Video and Audio Foundation Models
by: Sahoo, Pranab, et al.
Published: (2024)

Hearing Anywhere in Any Environment
by: Liu, Xiulong, et al.
Published: (2025)

Data Augmentation Using Neural Acoustic Fields With Retrieval-Augmented Pre-training
by: Ick, Christopher, et al.
Published: (2025)

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models
by: Lin, Zhiqiu, et al.
Published: (2023)

InfantCryNet: A Data-driven Framework for Intelligent Analysis of Infant Cries
by: Hong, Mengze, et al.
Published: (2024)

SONICS: Synthetic Or Not -- Identifying Counterfeit Songs
by: Rahman, Md Awsafur, et al.
Published: (2024)

Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
by: Xie, Zhifei, et al.
Published: (2024)

The Effect of Perceptual Metrics on Music Representation Learning for Genre Classification
by: Namgyal, Tashi, et al.
Published: (2024)

Direction-Aware Neural Acoustic Fields for Few-Shot Interpolation of Ambisonic Impulse Responses
by: Ick, Christopher, et al.
Published: (2025)

NOTA: Multimodal Music Notation Understanding for Visual Large Language Model
by: Tang, Mingni, et al.
Published: (2025)

VisionScores -- A system-segmented image score dataset for deep learning tasks
by: Amezcua, Alejandro Romero, et al.
Published: (2025)