Saved in:
| Main Authors: | Zhou, Yang-Hao, Li, Haitian, Lin, Rexar, Huang, Heyan, Zhou, Jinxing, Yuan, Changsen, Lan, Tian, Zhou, Ziqin, Li, Yudong, Xu, Jiajun, Liao, Jingyun, Cheng, Yi-Ming, Chen, Xuefeng, Mao, Xian-Ling, Feng, Yousheng |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.00607 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation
by: Li, Haitian, et al.
Published: (2026)
by: Li, Haitian, et al.
Published: (2026)
Look, Listen and Segment: Towards Weakly Supervised Audio-visual Semantic Segmentation
by: Li, Chengzhi, et al.
Published: (2026)
by: Li, Chengzhi, et al.
Published: (2026)
CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization
by: Zhou, Jinxing, et al.
Published: (2025)
by: Zhou, Jinxing, et al.
Published: (2025)
Patch-level Sounding Object Tracking for Audio-Visual Question Answering
by: Li, Zhangbin, et al.
Published: (2024)
by: Li, Zhangbin, et al.
Published: (2024)
Label-anticipated Event Disentanglement for Audio-Visual Video Parsing
by: Zhou, Jinxing, et al.
Published: (2024)
by: Zhou, Jinxing, et al.
Published: (2024)
Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling
by: Zhou, Jinxing, et al.
Published: (2024)
by: Zhou, Jinxing, et al.
Published: (2024)
Towards Open-Vocabulary Audio-Visual Event Localization
by: Zhou, Jinxing, et al.
Published: (2024)
by: Zhou, Jinxing, et al.
Published: (2024)
Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing
by: Zhao, Pengcheng, et al.
Published: (2024)
by: Zhao, Pengcheng, et al.
Published: (2024)
DreamFoley: Scalable VLMs for High-Fidelity Video-to-Audio Generation
by: Li, Fu, et al.
Published: (2025)
by: Li, Fu, et al.
Published: (2025)
Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes
by: Feng, Zhou, et al.
Published: (2025)
by: Feng, Zhou, et al.
Published: (2025)
Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation
by: Zhou, Jinxing, et al.
Published: (2025)
by: Zhou, Jinxing, et al.
Published: (2025)
MusiCRS: Benchmarking Audio-Centric Conversational Recommendation
by: Surana, Rohan, et al.
Published: (2025)
by: Surana, Rohan, et al.
Published: (2025)
LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding
by: Han, ZhaoYang, et al.
Published: (2025)
by: Han, ZhaoYang, et al.
Published: (2025)
From Natural Alignment to Conditional Controllability in Multimodal Dialogue
by: Jin, Zeyu, et al.
Published: (2026)
by: Jin, Zeyu, et al.
Published: (2026)
TAVGBench: Benchmarking Text to Audible-Video Generation
by: Mao, Yuxin, et al.
Published: (2024)
by: Mao, Yuxin, et al.
Published: (2024)
Audio-Visual Instance Segmentation
by: Guo, Ruohao, et al.
Published: (2023)
by: Guo, Ruohao, et al.
Published: (2023)
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding
by: Sun, Luoyi, et al.
Published: (2026)
by: Sun, Luoyi, et al.
Published: (2026)
GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting
by: Cho, Kyusun, et al.
Published: (2024)
by: Cho, Kyusun, et al.
Published: (2024)
Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling
by: Ye, Zhen, et al.
Published: (2026)
by: Ye, Zhen, et al.
Published: (2026)
Audio-Language Models for Audio-Centric Tasks: A Systematic Survey
by: Su, Yi, et al.
Published: (2025)
by: Su, Yi, et al.
Published: (2025)
SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing
by: Xiong, Lingyu, et al.
Published: (2024)
by: Xiong, Lingyu, et al.
Published: (2024)
Listen, Look, Drive: Coupling Audio Instructions for User-aware VLA-based Autonomous Driving
by: Guo, Ziang, et al.
Published: (2026)
by: Guo, Ziang, et al.
Published: (2026)
TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs
by: Xu, Pengju, et al.
Published: (2025)
by: Xu, Pengju, et al.
Published: (2025)
Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation
by: Zhou, Jinxing, et al.
Published: (2026)
by: Zhou, Jinxing, et al.
Published: (2026)
OpFlowTalker: Realistic and Natural Talking Face Generation via Optical Flow Guidance
by: Ge, Shuheng, et al.
Published: (2024)
by: Ge, Shuheng, et al.
Published: (2024)
EEmo-Bench: A Benchmark for Multi-modal Large Language Models on Image Evoked Emotion Assessment
by: Gao, Lancheng, et al.
Published: (2025)
by: Gao, Lancheng, et al.
Published: (2025)
FastTalker: Jointly Generating Speech and Conversational Gestures from Text
by: Guo, Zixin, et al.
Published: (2024)
by: Guo, Zixin, et al.
Published: (2024)
Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis
by: Huang, Zhiqi, et al.
Published: (2024)
by: Huang, Zhiqi, et al.
Published: (2024)
Two-stage dynamic creative optimization under sparse ambiguous samples for e-commerce advertising
by: Li, Guandong, et al.
Published: (2023)
by: Li, Guandong, et al.
Published: (2023)
Autoregressive Image Generation with Linear Complexity: A Spatial-Aware Decay Perspective
by: Mao, Yuxin, et al.
Published: (2025)
by: Mao, Yuxin, et al.
Published: (2025)
MOS-FAD: Improving Fake Audio Detection Via Automatic Mean Opinion Score Prediction
by: Zhou, Wangjin, et al.
Published: (2024)
by: Zhou, Wangjin, et al.
Published: (2024)
Trusted Fake Audio Detection Based on Dirichlet Distribution
by: Ding, Chi, et al.
Published: (2025)
by: Ding, Chi, et al.
Published: (2025)
Towards Multimodal Emotional Support Conversation Systems
by: Chu, Yuqi, et al.
Published: (2024)
by: Chu, Yuqi, et al.
Published: (2024)
Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer
by: Lei, Ke, et al.
Published: (2026)
by: Lei, Ke, et al.
Published: (2026)
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
by: Cheng, Shihao, et al.
Published: (2026)
by: Cheng, Shihao, et al.
Published: (2026)
CLIPRerank: An Extremely Simple Method for Improving Ad-hoc Video Search
by: Chen, Aozhu, et al.
Published: (2024)
by: Chen, Aozhu, et al.
Published: (2024)
Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding
by: Pan, Zhaoyan, et al.
Published: (2026)
by: Pan, Zhaoyan, et al.
Published: (2026)
DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder
by: Du, Chenpeng, et al.
Published: (2023)
by: Du, Chenpeng, et al.
Published: (2023)
Audio Matters Too! Enhancing Markerless Motion Capture with Audio Signals for String Performance Capture
by: Jin, Yitong, et al.
Published: (2024)
by: Jin, Yitong, et al.
Published: (2024)
Exploring the Role of Audio in Multimodal Misinformation Detection
by: Liu, Moyang, et al.
Published: (2024)
by: Liu, Moyang, et al.
Published: (2024)
Similar Items
-
MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation
by: Li, Haitian, et al.
Published: (2026) -
Look, Listen and Segment: Towards Weakly Supervised Audio-visual Semantic Segmentation
by: Li, Chengzhi, et al.
Published: (2026) -
CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization
by: Zhou, Jinxing, et al.
Published: (2025) -
Patch-level Sounding Object Tracking for Audio-Visual Question Answering
by: Li, Zhangbin, et al.
Published: (2024) -
Label-anticipated Event Disentanglement for Audio-Visual Video Parsing
by: Zhou, Jinxing, et al.
Published: (2024)