Saved in:
| Main Author: | Park, Siwoo |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.23010 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models
by: Lin, Zhiqiu, et al.
Published: (2023)
by: Lin, Zhiqiu, et al.
Published: (2023)
NOTA: Multimodal Music Notation Understanding for Visual Large Language Model
by: Tang, Mingni, et al.
Published: (2025)
by: Tang, Mingni, et al.
Published: (2025)
Integrating Audio Narrations to Strengthen Domain Generalization in Multimodal First-Person Action Recognition
by: Gungor, Cagri, et al.
Published: (2024)
by: Gungor, Cagri, et al.
Published: (2024)
DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap
by: Mo, Shentong, et al.
Published: (2025)
by: Mo, Shentong, et al.
Published: (2025)
Attention-Based Efficient Breath Sound Removal in Studio Audio Recordings
by: Elgiriyewithana, Nidula, et al.
Published: (2024)
by: Elgiriyewithana, Nidula, et al.
Published: (2024)
Multimodal Spatial Language Maps for Robot Navigation and Manipulation
by: Huang, Chenguang, et al.
Published: (2025)
by: Huang, Chenguang, et al.
Published: (2025)
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification
by: Zhu, Wentao
Published: (2024)
by: Zhu, Wentao
Published: (2024)
Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification
by: Zhu, Wentao
Published: (2024)
by: Zhu, Wentao
Published: (2024)
AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Deepfake Detection of Frontal Face Videos
by: Shahzad, Sahibzada Adil, et al.
Published: (2023)
by: Shahzad, Sahibzada Adil, et al.
Published: (2023)
MMFformer: Multimodal Fusion Transformer Network for Depression Detection
by: Haque, Md Rezwanul, et al.
Published: (2025)
by: Haque, Md Rezwanul, et al.
Published: (2025)
Ming-Omni: A Unified Multimodal Model for Perception and Generation
by: AI, Inclusion, et al.
Published: (2025)
by: AI, Inclusion, et al.
Published: (2025)
Modality-Inconsistent Continual Learning of Multimodal Large Language Models
by: Pian, Weiguo, et al.
Published: (2024)
by: Pian, Weiguo, et al.
Published: (2024)
Hearing Anywhere in Any Environment
by: Liu, Xiulong, et al.
Published: (2025)
by: Liu, Xiulong, et al.
Published: (2025)
Data Augmentation Using Neural Acoustic Fields With Retrieval-Augmented Pre-training
by: Ick, Christopher, et al.
Published: (2025)
by: Ick, Christopher, et al.
Published: (2025)
Spectral and Rhythm Feature Performance Evaluation for Category and Class Level Audio Classification with Deep Convolutional Neural Networks
by: Wolf-Monheim, Friedrich
Published: (2025)
by: Wolf-Monheim, Friedrich
Published: (2025)
Direction-Aware Neural Acoustic Fields for Few-Shot Interpolation of Ambisonic Impulse Responses
by: Ick, Christopher, et al.
Published: (2025)
by: Ick, Christopher, et al.
Published: (2025)
VisionScores -- A system-segmented image score dataset for deep learning tasks
by: Amezcua, Alejandro Romero, et al.
Published: (2025)
by: Amezcua, Alejandro Romero, et al.
Published: (2025)
Chirp Localization via Fine-Tuned Transformer Model: A Proof-of-Concept Study
by: Bahador, Nooshin, et al.
Published: (2025)
by: Bahador, Nooshin, et al.
Published: (2025)
DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis
by: Ahmad, Zeeshan, et al.
Published: (2025)
by: Ahmad, Zeeshan, et al.
Published: (2025)
Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows
by: Mo, Shentong, et al.
Published: (2026)
by: Mo, Shentong, et al.
Published: (2026)
Tell What You Hear From What You See -- Video to Audio Generation Through Text
by: Liu, Xiulong, et al.
Published: (2024)
by: Liu, Xiulong, et al.
Published: (2024)
InfantCryNet: A Data-driven Framework for Intelligent Analysis of Infant Cries
by: Hong, Mengze, et al.
Published: (2024)
by: Hong, Mengze, et al.
Published: (2024)
SONICS: Synthetic Or Not -- Identifying Counterfeit Songs
by: Rahman, Md Awsafur, et al.
Published: (2024)
by: Rahman, Md Awsafur, et al.
Published: (2024)
Synthesizing Audio from Silent Video using Sequence to Sequence Modeling
by: Belinchon, Hugo Garrido-Lestache, et al.
Published: (2024)
by: Belinchon, Hugo Garrido-Lestache, et al.
Published: (2024)
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
by: Xie, Zhifei, et al.
Published: (2024)
by: Xie, Zhifei, et al.
Published: (2024)
The Effect of Perceptual Metrics on Music Representation Learning for Genre Classification
by: Namgyal, Tashi, et al.
Published: (2024)
by: Namgyal, Tashi, et al.
Published: (2024)
Spectral and Rhythm Features for Audio Classification with Deep Convolutional Neural Networks
by: Wolf-Monheim, Friedrich
Published: (2024)
by: Wolf-Monheim, Friedrich
Published: (2024)
GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining
by: Mo, Shentong, et al.
Published: (2026)
by: Mo, Shentong, et al.
Published: (2026)
3DFacePolicy: Audio-Driven 3D Facial Animation Based on Action Control
by: Sha, Xuanmeng, et al.
Published: (2024)
by: Sha, Xuanmeng, et al.
Published: (2024)
AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection
by: Hashmi, Ammarah, et al.
Published: (2023)
by: Hashmi, Ammarah, et al.
Published: (2023)
Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion
by: Sun, Yu, et al.
Published: (2025)
by: Sun, Yu, et al.
Published: (2025)
PianoVAM: A Multimodal Piano Performance Dataset
by: Kim, Yonghyun, et al.
Published: (2025)
by: Kim, Yonghyun, et al.
Published: (2025)
Diffusion-based Unsupervised Audio-visual Speech Enhancement
by: Ayilo, Jean-Eudes, et al.
Published: (2024)
by: Ayilo, Jean-Eudes, et al.
Published: (2024)
WhaleNet: a Novel Deep Learning Architecture for Marine Mammals Vocalizations on Watkins Marine Mammal Sound Database
by: Licciardi, Alessandro, et al.
Published: (2024)
by: Licciardi, Alessandro, et al.
Published: (2024)
SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing
by: Chen, Mingfei, et al.
Published: (2025)
by: Chen, Mingfei, et al.
Published: (2025)
Unified Video-Language Pre-training with Synchronized Audio
by: Mo, Shentong, et al.
Published: (2024)
by: Mo, Shentong, et al.
Published: (2024)
Aligning Audio-Visual Joint Representations with an Agentic Workflow
by: Mo, Shentong, et al.
Published: (2024)
by: Mo, Shentong, et al.
Published: (2024)
Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook
by: Croitoru, Florinel-Alin, et al.
Published: (2024)
by: Croitoru, Florinel-Alin, et al.
Published: (2024)
Text-to-Audio Generation Synchronized with Videos
by: Mo, Shentong, et al.
Published: (2024)
by: Mo, Shentong, et al.
Published: (2024)
Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation
by: Zhou, Jinxing, et al.
Published: (2026)
by: Zhou, Jinxing, et al.
Published: (2026)
Similar Items
-
Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models
by: Lin, Zhiqiu, et al.
Published: (2023) -
NOTA: Multimodal Music Notation Understanding for Visual Large Language Model
by: Tang, Mingni, et al.
Published: (2025) -
Integrating Audio Narrations to Strengthen Domain Generalization in Multimodal First-Person Action Recognition
by: Gungor, Cagri, et al.
Published: (2024) -
DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap
by: Mo, Shentong, et al.
Published: (2025) -
Attention-Based Efficient Breath Sound Removal in Studio Audio Recordings
by: Elgiriyewithana, Nidula, et al.
Published: (2024)