Saved in:
| Main Authors: | Liu, Xiaochuan, Cheng, Xin, Sun, Yuchong, Wu, Xiaoxue, Song, Ruihua, Sun, Hao, Zhang, Denghao |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.20858 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation
by: Chao, Jianghan, et al.
Published: (2025)
by: Chao, Jianghan, et al.
Published: (2025)
LoVA: Long-form Video-to-Audio Generation
by: Cheng, Xin, et al.
Published: (2024)
by: Cheng, Xin, et al.
Published: (2024)
VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task
by: Wang, Yuyue, et al.
Published: (2025)
by: Wang, Yuyue, et al.
Published: (2025)
ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation
by: Sun, Jiahui, et al.
Published: (2025)
by: Sun, Jiahui, et al.
Published: (2025)
3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation
by: Li, Yaoru, et al.
Published: (2025)
by: Li, Yaoru, et al.
Published: (2025)
Routing Experts: Learning to Route Dynamic Experts in Multi-modal Large Language Models
by: Wu, Qiong, et al.
Published: (2024)
by: Wu, Qiong, et al.
Published: (2024)
SRA: Semantic Relation-Aware Flowchart Question Answering
by: Li, Xinyu, et al.
Published: (2026)
by: Li, Xinyu, et al.
Published: (2026)
DiffCL: A Diffusion-Based Contrastive Learning Framework with Semantic Alignment for Multimodal Recommendations
by: Song, Qiya, et al.
Published: (2025)
by: Song, Qiya, et al.
Published: (2025)
AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction
by: Chen, Zixuan, et al.
Published: (2026)
by: Chen, Zixuan, et al.
Published: (2026)
FGAS: Fixed Decoder Network-Based Audio Steganography with Adversarial Perturbation Generation
by: Yan, Jialin, et al.
Published: (2025)
by: Yan, Jialin, et al.
Published: (2025)
SentiAvatar: Towards Expressive and Interactive Digital Humans
by: Jin, Chuhao, et al.
Published: (2026)
by: Jin, Chuhao, et al.
Published: (2026)
Audio Matters Too! Enhancing Markerless Motion Capture with Audio Signals for String Performance Capture
by: Jin, Yitong, et al.
Published: (2024)
by: Jin, Yitong, et al.
Published: (2024)
Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence
by: Liao, Junchao, et al.
Published: (2026)
by: Liao, Junchao, et al.
Published: (2026)
Unsupervised Ego- and Exo-centric Dense Procedural Activity Captioning via Gaze Consensus Adaptation
by: Shi, Zhaofeng, et al.
Published: (2025)
by: Shi, Zhaofeng, et al.
Published: (2025)
Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer
by: Lei, Ke, et al.
Published: (2026)
by: Lei, Ke, et al.
Published: (2026)
MixFake: Benchmarking and Enhancing Audio Deepfake Detection in Diverse Real-world Mixed Audio
by: Li, Qingcao, et al.
Published: (2026)
by: Li, Qingcao, et al.
Published: (2026)
Human-Inspired Computing for Robust and Efficient Audio-Visual Speech Recognition
by: Liu, Qianhui, et al.
Published: (2024)
by: Liu, Qianhui, et al.
Published: (2024)
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding
by: Sun, Luoyi, et al.
Published: (2026)
by: Sun, Luoyi, et al.
Published: (2026)
Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models
by: Chen, Yiming, et al.
Published: (2024)
by: Chen, Yiming, et al.
Published: (2024)
A 3D-Cascading Crossing Coupling Framework for Hyperchaotic Map Construction and Its Application to Color Image Encryption
by: Sun, Jilei, et al.
Published: (2025)
by: Sun, Jilei, et al.
Published: (2025)
Trusted Fake Audio Detection Based on Dirichlet Distribution
by: Ding, Chi, et al.
Published: (2025)
by: Ding, Chi, et al.
Published: (2025)
Sec2Sec Co-attention for Video-Based Apparent Affective Prediction
by: Sun, Mingwei, et al.
Published: (2024)
by: Sun, Mingwei, et al.
Published: (2024)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning
by: Sun, Luoyi, et al.
Published: (2023)
by: Sun, Luoyi, et al.
Published: (2023)
Robust Latent Representation Tuning for Image-text Classification
by: Sun, Hao, et al.
Published: (2024)
by: Sun, Hao, et al.
Published: (2024)
Beyond Audio and Pose: A General-Purpose Framework for Video Synchronization
by: Shin, Yosub, et al.
Published: (2025)
by: Shin, Yosub, et al.
Published: (2025)
XGC-AVis: Towards Audio-Visual Content Understanding with a Multi-Agent Collaborative System
by: Cao, Yuqin, et al.
Published: (2025)
by: Cao, Yuqin, et al.
Published: (2025)
Compressed Deepfake Video Detection Based on 3D Spatiotemporal Trajectories
by: Chen, Zongmei, et al.
Published: (2024)
by: Chen, Zongmei, et al.
Published: (2024)
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
by: Cheng, Shihao, et al.
Published: (2026)
by: Cheng, Shihao, et al.
Published: (2026)
Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning
by: Wu, Shu, et al.
Published: (2025)
by: Wu, Shu, et al.
Published: (2025)
Manipulated Regions Localization For Partially Deepfake Audio: A Survey
by: He, Jiayi, et al.
Published: (2025)
by: He, Jiayi, et al.
Published: (2025)
PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks
by: Wu, Jianyu, et al.
Published: (2025)
by: Wu, Jianyu, et al.
Published: (2025)
Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline
by: Yang, Dingyi, et al.
Published: (2024)
by: Yang, Dingyi, et al.
Published: (2024)
Spatial-Temporal Human-Object Interaction Detection
by: Sun, Xu, et al.
Published: (2025)
by: Sun, Xu, et al.
Published: (2025)
EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos
by: Rai, Aashish, et al.
Published: (2024)
by: Rai, Aashish, et al.
Published: (2024)
EmpathyEar: An Open-source Avatar Multimodal Empathetic Chatbot
by: Fei, Hao, et al.
Published: (2024)
by: Fei, Hao, et al.
Published: (2024)
AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers
by: Guan, Jiazhi, et al.
Published: (2025)
by: Guan, Jiazhi, et al.
Published: (2025)
Iterative Residual Cross-Attention Mechanism: An Integrated Approach for Audio-Visual Navigation Tasks
by: Zhang, Hailong, et al.
Published: (2025)
by: Zhang, Hailong, et al.
Published: (2025)
Audio-Guided Visual Perception for Audio-Visual Navigation
by: Wang, Yi, et al.
Published: (2025)
by: Wang, Yi, et al.
Published: (2025)
ChronusOmni: Improving Time Awareness of Omni Large Language Models
by: Chen, Yijing, et al.
Published: (2025)
by: Chen, Yijing, et al.
Published: (2025)
Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning
by: Xu, Xinmeng, et al.
Published: (2026)
by: Xu, Xinmeng, et al.
Published: (2026)
Similar Items
-
JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation
by: Chao, Jianghan, et al.
Published: (2025) -
LoVA: Long-form Video-to-Audio Generation
by: Cheng, Xin, et al.
Published: (2024) -
VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task
by: Wang, Yuyue, et al.
Published: (2025) -
ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation
by: Sun, Jiahui, et al.
Published: (2025) -
3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation
by: Li, Yaoru, et al.
Published: (2025)