:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Tang, Yolo Y., Liu, Pinxin, Tan, Zhangyun, Feng, Mingqian, Mao, Rui, Huang, Chao, Bi, Jing, Xiao, Yunzhong, Liang, Susan, Hua, Hang, Vosoughi, Ali, Song, Luchuan, Zhang, Zeliang, Xu, Chenliang
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2505.20426
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning
by: Tan, Zhangyun, et al.
Published: (2026)

VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?
by: Tang, Yolo Y., et al.
Published: (2024)

Generative AI for Cel-Animation: A Survey
by: Tang, Yolo Y., et al.
Published: (2025)

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding
by: Tang, Yolo Yunlong, et al.
Published: (2024)

Video Understanding with Large Language Models: A Survey
by: Tang, Yolo Y., et al.
Published: (2023)

Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
by: Tang, Yunlong, et al.
Published: (2025)

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
by: Tang, Yolo Y., et al.
Published: (2025)

TDMM-LM: Bridging Facial Understanding and Animation via Language Models
by: Song, Luchuan, et al.
Published: (2026)

Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation?
by: Liang, Susan, et al.
Published: (2026)

Adaptive Super Resolution For One-Shot Talking-Head Generation
by: Song, Luchuan, et al.
Published: (2024)

Can Sound Replace Vision in LLaVA With Token Substitution?
by: Vosoughi, Ali, et al.
Published: (2025)

Do More Details Always Introduce More Hallucinations in LVLM-based Image Captioning?
by: Feng, Mingqian, et al.
Published: (2024)

VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity
by: Bi, Jing, et al.
Published: (2025)

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination
by: Tang, Yolo Y., et al.
Published: (2025)

GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling
by: Liu, Pinxin, et al.
Published: (2025)

Tri$^{2}$-plane: Thinking Head Avatar via Feature Pyramid
by: Song, Luchuan, et al.
Published: (2024)

TextToon: Real-Time Text Toonify Head Avatar from Single Video
by: Song, Luchuan, et al.
Published: (2024)

$I^2G$: Generating Instructional Illustrations via Text-Conditioned Diffusion
by: Bi, Jing, et al.
Published: (2025)

Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives
by: Zhang, Zeliang, et al.
Published: (2025)

EAGLE: Egocentric AGgregated Language-video Engine
by: Bi, Jing, et al.
Published: (2024)

GaussianStyle: Gaussian Head Avatar via StyleGAN
by: Liu, Pinxin, et al.
Published: (2024)

Will the Inclusion of Generated Data Amplify Bias Across Generations in Future Image Classification Models?
by: Zhang, Zeliang, et al.
Published: (2024)

Intentional Gesture: Deliver Your Intentions with Gestures for Speech
by: Liu, Pinxin, et al.
Published: (2025)

OPENXRD: A Comprehensive Benchmark Framework for LLM/MLLM XRD Question Answering
by: Vosoughi, Ali, et al.
Published: (2025)

Discover and Mitigate Multiple Biased Subgroups in Image Classifiers
by: Zhang, Zeliang, et al.
Published: (2024)

Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP
by: Zhang, Zeliang, et al.
Published: (2024)

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning
by: Hua, Hang, et al.
Published: (2024)

Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)
by: Bi, Jing, et al.
Published: (2025)

Forward Learning with Differential Privacy
by: Feng, Mingqian, et al.
Published: (2025)

When to Think and When to Look: Uncertainty-Guided Lookback
by: Bi, Jing, et al.
Published: (2025)

Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward
by: Bi, Jing, et al.
Published: (2025)

Forward Learning for Gradient-based Black-box Saliency Map Generation
by: Zhang, Zeliang, et al.
Published: (2024)

CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion
by: Tang, Yolo Yunlong, et al.
Published: (2024)

Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision
by: Li, Chentao, et al.
Published: (2026)

What to Do Next? Memorizing skills from Egocentric Instructional Video
by: Bi, Jing, et al.
Published: (2025)

Learning to Transform Dynamically for Better Adversarial Transferability
by: Zhu, Rongyi, et al.
Published: (2024)

MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs
by: Yuan, Jiakang, et al.
Published: (2025)

Harnessing the Computation Redundancy in ViTs to Boost Adversarial Transferability
by: Liu, Jiani, et al.
Published: (2025)

OSCaR: Object State Captioning and State Change Representation
by: Nguyen, Nguyen, et al.
Published: (2024)

Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images
by: Chen, Yuangong, et al.
Published: (2026)