Saved in:
| Main Author: | Sulun, Serkan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.07063 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation
by: Lin, Yan-Bo, et al.
Published: (2026)
by: Lin, Yan-Bo, et al.
Published: (2026)
Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment
by: Hong, Jiaying, et al.
Published: (2025)
by: Hong, Jiaying, et al.
Published: (2025)
VEMOCLAP: A video emotion classification web application
by: Sulun, Serkan, et al.
Published: (2024)
by: Sulun, Serkan, et al.
Published: (2024)
Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries
by: Sulun, Serkan, et al.
Published: (2025)
by: Sulun, Serkan, et al.
Published: (2025)
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
by: Tian, Zeyue, et al.
Published: (2024)
by: Tian, Zeyue, et al.
Published: (2024)
Text-to-Audio Generation Synchronized with Videos
by: Mo, Shentong, et al.
Published: (2024)
by: Mo, Shentong, et al.
Published: (2024)
ChoreoMuse: Robust Music-to-Dance Video Generation with Style Transfer and Beat-Adherent Motion
by: Wang, Xuanchen, et al.
Published: (2025)
by: Wang, Xuanchen, et al.
Published: (2025)
SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment
by: Shahzad, Sahibzada Adil, et al.
Published: (2026)
by: Shahzad, Sahibzada Adil, et al.
Published: (2026)
Multimodal Real-Time Anomaly Detection and Industrial Applications
by: Verma, Aman, et al.
Published: (2025)
by: Verma, Aman, et al.
Published: (2025)
Diffusion Models for Joint Audio-Video Generation
by: La Torre, Alejandro Paredes
Published: (2026)
by: La Torre, Alejandro Paredes
Published: (2026)
Apollo: Unified Multi-Task Audio-Video Joint Generation
by: Wang, Jun, et al.
Published: (2026)
by: Wang, Jun, et al.
Published: (2026)
Do Joint Audio-Video Generation Models Understand Physics?
by: Cui, Zijun, et al.
Published: (2026)
by: Cui, Zijun, et al.
Published: (2026)
FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation
by: Tan, Weiting, et al.
Published: (2026)
by: Tan, Weiting, et al.
Published: (2026)
TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation
by: Wang, Zhenzhi, et al.
Published: (2025)
by: Wang, Zhenzhi, et al.
Published: (2025)
Unified Video-Language Pre-training with Synchronized Audio
by: Mo, Shentong, et al.
Published: (2024)
by: Mo, Shentong, et al.
Published: (2024)
AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models
by: Yang, Jialiang, et al.
Published: (2026)
by: Yang, Jialiang, et al.
Published: (2026)
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification
by: Zhu, Wentao
Published: (2024)
by: Zhu, Wentao
Published: (2024)
Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification
by: Zhu, Wentao
Published: (2024)
by: Zhu, Wentao
Published: (2024)
OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset
by: Park, Jeongkyun, et al.
Published: (2023)
by: Park, Jeongkyun, et al.
Published: (2023)
AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection
by: Hashmi, Ammarah, et al.
Published: (2023)
by: Hashmi, Ammarah, et al.
Published: (2023)
Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook
by: Croitoru, Florinel-Alin, et al.
Published: (2024)
by: Croitoru, Florinel-Alin, et al.
Published: (2024)
AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Deepfake Detection of Frontal Face Videos
by: Shahzad, Sahibzada Adil, et al.
Published: (2023)
by: Shahzad, Sahibzada Adil, et al.
Published: (2023)
V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
by: Su, Kun, et al.
Published: (2023)
by: Su, Kun, et al.
Published: (2023)
Movie Trailer Genre Classification Using Multimodal Pretrained Features
by: Sulun, Serkan, et al.
Published: (2024)
by: Sulun, Serkan, et al.
Published: (2024)
Vision-to-Music Generation: A Survey
by: Wang, Zhaokai, et al.
Published: (2025)
by: Wang, Zhaokai, et al.
Published: (2025)
DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization
by: Nguyen, Ngoc-Son, et al.
Published: (2026)
by: Nguyen, Ngoc-Son, et al.
Published: (2026)
EXPOTION: Facial Expression and Motion Control for Multimodal Music Generation
by: Izzati, Fathinah, et al.
Published: (2025)
by: Izzati, Fathinah, et al.
Published: (2025)
Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing
by: Tian, Zeyue, et al.
Published: (2026)
by: Tian, Zeyue, et al.
Published: (2026)
Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation
by: Zhou, Jinxing, et al.
Published: (2026)
by: Zhou, Jinxing, et al.
Published: (2026)
Aligning Audio-Visual Joint Representations with an Agentic Workflow
by: Mo, Shentong, et al.
Published: (2024)
by: Mo, Shentong, et al.
Published: (2024)
Understanding Audiovisual Deepfake Detection: Techniques, Challenges, Human Factors and Perceptual Insights
by: Hashmi, Ammarah, et al.
Published: (2024)
by: Hashmi, Ammarah, et al.
Published: (2024)
3DFacePolicy: Audio-Driven 3D Facial Animation Based on Action Control
by: Sha, Xuanmeng, et al.
Published: (2024)
by: Sha, Xuanmeng, et al.
Published: (2024)
SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing
by: Chen, Mingfei, et al.
Published: (2025)
by: Chen, Mingfei, et al.
Published: (2025)
JEN-1 Composer: A Unified Framework for High-Fidelity Multi-Track Music Generation
by: Yao, Yao, et al.
Published: (2023)
by: Yao, Yao, et al.
Published: (2023)
Multimodal Sentiment Analysis based on Video and Audio Inputs
by: Fernandez, Antonio, et al.
Published: (2024)
by: Fernandez, Antonio, et al.
Published: (2024)
PF-D2M: A Pose-free Diffusion Model for Universal Dance-to-Music Generation
by: Im, Jaekwon, et al.
Published: (2026)
by: Im, Jaekwon, et al.
Published: (2026)
Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning
by: Zeng, Donghuo, et al.
Published: (2026)
by: Zeng, Donghuo, et al.
Published: (2026)
DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression
by: Li, Bingzhou, et al.
Published: (2026)
by: Li, Bingzhou, et al.
Published: (2026)
EARTalking: End-to-end GPT-style Autoregressive Talking Head Synthesis with Frame-wise Control
by: Weng, Yuzhe, et al.
Published: (2026)
by: Weng, Yuzhe, et al.
Published: (2026)
VoxAging: Continuously Tracking Speaker Aging with a Large-Scale Longitudinal Dataset in English and Mandarin
by: Ai, Zhiqi, et al.
Published: (2025)
by: Ai, Zhiqi, et al.
Published: (2025)
Similar Items
-
V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation
by: Lin, Yan-Bo, et al.
Published: (2026) -
Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment
by: Hong, Jiaying, et al.
Published: (2025) -
VEMOCLAP: A video emotion classification web application
by: Sulun, Serkan, et al.
Published: (2024) -
Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries
by: Sulun, Serkan, et al.
Published: (2025) -
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
by: Tian, Zeyue, et al.
Published: (2024)