:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Park, Siwoo
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence Computer Vision and Pattern Recognition Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2507.23010
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models
by: Lin, Zhiqiu, et al.
Published: (2023)

NOTA: Multimodal Music Notation Understanding for Visual Large Language Model
by: Tang, Mingni, et al.
Published: (2025)

Integrating Audio Narrations to Strengthen Domain Generalization in Multimodal First-Person Action Recognition
by: Gungor, Cagri, et al.
Published: (2024)

DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap
by: Mo, Shentong, et al.
Published: (2025)

Attention-Based Efficient Breath Sound Removal in Studio Audio Recordings
by: Elgiriyewithana, Nidula, et al.
Published: (2024)

Multimodal Spatial Language Maps for Robot Navigation and Manipulation
by: Huang, Chenguang, et al.
Published: (2025)

Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification
by: Zhu, Wentao
Published: (2024)

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification
by: Zhu, Wentao
Published: (2024)

AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Deepfake Detection of Frontal Face Videos
by: Shahzad, Sahibzada Adil, et al.
Published: (2023)

MMFformer: Multimodal Fusion Transformer Network for Depression Detection
by: Haque, Md Rezwanul, et al.
Published: (2025)

Ming-Omni: A Unified Multimodal Model for Perception and Generation
by: AI, Inclusion, et al.
Published: (2025)

Modality-Inconsistent Continual Learning of Multimodal Large Language Models
by: Pian, Weiguo, et al.
Published: (2024)

Hearing Anywhere in Any Environment
by: Liu, Xiulong, et al.
Published: (2025)

Data Augmentation Using Neural Acoustic Fields With Retrieval-Augmented Pre-training
by: Ick, Christopher, et al.
Published: (2025)

Spectral and Rhythm Feature Performance Evaluation for Category and Class Level Audio Classification with Deep Convolutional Neural Networks
by: Wolf-Monheim, Friedrich
Published: (2025)

Direction-Aware Neural Acoustic Fields for Few-Shot Interpolation of Ambisonic Impulse Responses
by: Ick, Christopher, et al.
Published: (2025)

VisionScores -- A system-segmented image score dataset for deep learning tasks
by: Amezcua, Alejandro Romero, et al.
Published: (2025)

Chirp Localization via Fine-Tuned Transformer Model: A Proof-of-Concept Study
by: Bahador, Nooshin, et al.
Published: (2025)

DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis
by: Ahmad, Zeeshan, et al.
Published: (2025)

Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows
by: Mo, Shentong, et al.
Published: (2026)

Tell What You Hear From What You See -- Video to Audio Generation Through Text
by: Liu, Xiulong, et al.
Published: (2024)

InfantCryNet: A Data-driven Framework for Intelligent Analysis of Infant Cries
by: Hong, Mengze, et al.
Published: (2024)

SONICS: Synthetic Or Not -- Identifying Counterfeit Songs
by: Rahman, Md Awsafur, et al.
Published: (2024)

Synthesizing Audio from Silent Video using Sequence to Sequence Modeling
by: Belinchon, Hugo Garrido-Lestache, et al.
Published: (2024)

Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
by: Xie, Zhifei, et al.
Published: (2024)

The Effect of Perceptual Metrics on Music Representation Learning for Genre Classification
by: Namgyal, Tashi, et al.
Published: (2024)

Spectral and Rhythm Features for Audio Classification with Deep Convolutional Neural Networks
by: Wolf-Monheim, Friedrich
Published: (2024)

GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining
by: Mo, Shentong, et al.
Published: (2026)

3DFacePolicy: Audio-Driven 3D Facial Animation Based on Action Control
by: Sha, Xuanmeng, et al.
Published: (2024)

AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection
by: Hashmi, Ammarah, et al.
Published: (2023)

Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion
by: Sun, Yu, et al.
Published: (2025)

PianoVAM: A Multimodal Piano Performance Dataset
by: Kim, Yonghyun, et al.
Published: (2025)

Diffusion-based Unsupervised Audio-visual Speech Enhancement
by: Ayilo, Jean-Eudes, et al.
Published: (2024)

WhaleNet: a Novel Deep Learning Architecture for Marine Mammals Vocalizations on Watkins Marine Mammal Sound Database
by: Licciardi, Alessandro, et al.
Published: (2024)

SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing
by: Chen, Mingfei, et al.
Published: (2025)

Unified Video-Language Pre-training with Synchronized Audio
by: Mo, Shentong, et al.
Published: (2024)

Aligning Audio-Visual Joint Representations with an Agentic Workflow
by: Mo, Shentong, et al.
Published: (2024)

Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook
by: Croitoru, Florinel-Alin, et al.
Published: (2024)

Text-to-Audio Generation Synchronized with Videos
by: Mo, Shentong, et al.
Published: (2024)

Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation
by: Zhou, Jinxing, et al.
Published: (2026)