Saved in:
| Main Authors: | Tang, Yolo Y., Liu, Pinxin, Tan, Zhangyun, Feng, Mingqian, Mao, Rui, Huang, Chao, Bi, Jing, Xiao, Yunzhong, Liang, Susan, Hua, Hang, Vosoughi, Ali, Song, Luchuan, Zhang, Zeliang, Xu, Chenliang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.20426 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning
by: Tan, Zhangyun, et al.
Published: (2026)
by: Tan, Zhangyun, et al.
Published: (2026)
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?
by: Tang, Yolo Y., et al.
Published: (2024)
by: Tang, Yolo Y., et al.
Published: (2024)
Generative AI for Cel-Animation: A Survey
by: Tang, Yolo Y., et al.
Published: (2025)
by: Tang, Yolo Y., et al.
Published: (2025)
Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding
by: Tang, Yolo Yunlong, et al.
Published: (2024)
by: Tang, Yolo Yunlong, et al.
Published: (2024)
Video Understanding with Large Language Models: A Survey
by: Tang, Yolo Y., et al.
Published: (2023)
by: Tang, Yolo Y., et al.
Published: (2023)
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
by: Tang, Yunlong, et al.
Published: (2025)
by: Tang, Yunlong, et al.
Published: (2025)
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
by: Tang, Yolo Y., et al.
Published: (2025)
by: Tang, Yolo Y., et al.
Published: (2025)
TDMM-LM: Bridging Facial Understanding and Animation via Language Models
by: Song, Luchuan, et al.
Published: (2026)
by: Song, Luchuan, et al.
Published: (2026)
Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation?
by: Liang, Susan, et al.
Published: (2026)
by: Liang, Susan, et al.
Published: (2026)
Adaptive Super Resolution For One-Shot Talking-Head Generation
by: Song, Luchuan, et al.
Published: (2024)
by: Song, Luchuan, et al.
Published: (2024)
Can Sound Replace Vision in LLaVA With Token Substitution?
by: Vosoughi, Ali, et al.
Published: (2025)
by: Vosoughi, Ali, et al.
Published: (2025)
Do More Details Always Introduce More Hallucinations in LVLM-based Image Captioning?
by: Feng, Mingqian, et al.
Published: (2024)
by: Feng, Mingqian, et al.
Published: (2024)
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity
by: Bi, Jing, et al.
Published: (2025)
by: Bi, Jing, et al.
Published: (2025)
Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination
by: Tang, Yolo Y., et al.
Published: (2025)
by: Tang, Yolo Y., et al.
Published: (2025)
GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling
by: Liu, Pinxin, et al.
Published: (2025)
by: Liu, Pinxin, et al.
Published: (2025)
Tri$^{2}$-plane: Thinking Head Avatar via Feature Pyramid
by: Song, Luchuan, et al.
Published: (2024)
by: Song, Luchuan, et al.
Published: (2024)
TextToon: Real-Time Text Toonify Head Avatar from Single Video
by: Song, Luchuan, et al.
Published: (2024)
by: Song, Luchuan, et al.
Published: (2024)
$I^2G$: Generating Instructional Illustrations via Text-Conditioned Diffusion
by: Bi, Jing, et al.
Published: (2025)
by: Bi, Jing, et al.
Published: (2025)
Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives
by: Zhang, Zeliang, et al.
Published: (2025)
by: Zhang, Zeliang, et al.
Published: (2025)
EAGLE: Egocentric AGgregated Language-video Engine
by: Bi, Jing, et al.
Published: (2024)
by: Bi, Jing, et al.
Published: (2024)
GaussianStyle: Gaussian Head Avatar via StyleGAN
by: Liu, Pinxin, et al.
Published: (2024)
by: Liu, Pinxin, et al.
Published: (2024)
Will the Inclusion of Generated Data Amplify Bias Across Generations in Future Image Classification Models?
by: Zhang, Zeliang, et al.
Published: (2024)
by: Zhang, Zeliang, et al.
Published: (2024)
Intentional Gesture: Deliver Your Intentions with Gestures for Speech
by: Liu, Pinxin, et al.
Published: (2025)
by: Liu, Pinxin, et al.
Published: (2025)
OPENXRD: A Comprehensive Benchmark Framework for LLM/MLLM XRD Question Answering
by: Vosoughi, Ali, et al.
Published: (2025)
by: Vosoughi, Ali, et al.
Published: (2025)
Discover and Mitigate Multiple Biased Subgroups in Image Classifiers
by: Zhang, Zeliang, et al.
Published: (2024)
by: Zhang, Zeliang, et al.
Published: (2024)
Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP
by: Zhang, Zeliang, et al.
Published: (2024)
by: Zhang, Zeliang, et al.
Published: (2024)
V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning
by: Hua, Hang, et al.
Published: (2024)
by: Hua, Hang, et al.
Published: (2024)
Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)
by: Bi, Jing, et al.
Published: (2025)
by: Bi, Jing, et al.
Published: (2025)
Forward Learning with Differential Privacy
by: Feng, Mingqian, et al.
Published: (2025)
by: Feng, Mingqian, et al.
Published: (2025)
When to Think and When to Look: Uncertainty-Guided Lookback
by: Bi, Jing, et al.
Published: (2025)
by: Bi, Jing, et al.
Published: (2025)
Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward
by: Bi, Jing, et al.
Published: (2025)
by: Bi, Jing, et al.
Published: (2025)
Forward Learning for Gradient-based Black-box Saliency Map Generation
by: Zhang, Zeliang, et al.
Published: (2024)
by: Zhang, Zeliang, et al.
Published: (2024)
CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion
by: Tang, Yolo Yunlong, et al.
Published: (2024)
by: Tang, Yolo Yunlong, et al.
Published: (2024)
Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision
by: Li, Chentao, et al.
Published: (2026)
by: Li, Chentao, et al.
Published: (2026)
What to Do Next? Memorizing skills from Egocentric Instructional Video
by: Bi, Jing, et al.
Published: (2025)
by: Bi, Jing, et al.
Published: (2025)
Learning to Transform Dynamically for Better Adversarial Transferability
by: Zhu, Rongyi, et al.
Published: (2024)
by: Zhu, Rongyi, et al.
Published: (2024)
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs
by: Yuan, Jiakang, et al.
Published: (2025)
by: Yuan, Jiakang, et al.
Published: (2025)
Harnessing the Computation Redundancy in ViTs to Boost Adversarial Transferability
by: Liu, Jiani, et al.
Published: (2025)
by: Liu, Jiani, et al.
Published: (2025)
OSCaR: Object State Captioning and State Change Representation
by: Nguyen, Nguyen, et al.
Published: (2024)
by: Nguyen, Nguyen, et al.
Published: (2024)
Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images
by: Chen, Yuangong, et al.
Published: (2026)
by: Chen, Yuangong, et al.
Published: (2026)
Similar Items
-
Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning
by: Tan, Zhangyun, et al.
Published: (2026) -
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?
by: Tang, Yolo Y., et al.
Published: (2024) -
Generative AI for Cel-Animation: A Survey
by: Tang, Yolo Y., et al.
Published: (2025) -
Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding
by: Tang, Yolo Yunlong, et al.
Published: (2024) -
Video Understanding with Large Language Models: A Survey
by: Tang, Yolo Y., et al.
Published: (2023)