:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Xingjian, Wen, Siwei, Wu, Wenjun, Huang, Lei
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2504.09641
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

TinyLLaVA-Video: Towards Smaller LMMs for Video Understanding with Group Resampler
by: Zhang, Xingjian, et al.
Published: (2025)

EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use
by: Wen, Siwei, et al.
Published: (2026)

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
by: Shu, Fangxun, et al.
Published: (2024)

3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer
by: Deng, Jiajun, et al.
Published: (2025)

LLaVA-Video: Video Instruction Tuning With Synthetic Data
by: Zhang, Yuanhan, et al.
Published: (2024)

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs
by: Zhang, Zicheng, et al.
Published: (2024)

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
by: Zhu, Chenming, et al.
Published: (2024)

LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding
by: Zhou, Hanyu, et al.
Published: (2025)

How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs
by: Khattak, Muhammad Uzair, et al.
Published: (2024)

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
by: Xu, Lin, et al.
Published: (2024)

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
by: Yuan, Haobo, et al.
Published: (2025)

LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs
by: Shen, Leqi, et al.
Published: (2025)

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
by: Lin, Bin, et al.
Published: (2023)

LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval
by: Lu, Weiheng, et al.
Published: (2024)

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos
by: Zhang, Jianrui, et al.
Published: (2024)

Video-R1: Reinforcing Video Reasoning in MLLMs
by: Feng, Kaituo, et al.
Published: (2025)

LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning
by: Li, Jiajie, et al.
Published: (2024)

ViLLa: Video Reasoning Segmentation with Large Language Model
by: Zheng, Rongkun, et al.
Published: (2024)

Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought
by: Huang, Chao, et al.
Published: (2025)

MMSearch-R1: Incentivizing LMMs to Search
by: Wu, Jinming, et al.
Published: (2025)

VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs
by: Bharadwaj, Rohit, et al.
Published: (2024)

LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs
by: Zhou, Hanyu, et al.
Published: (2025)

ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos
by: Vuong, Trinh T. L., et al.
Published: (2025)

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
by: Zhao, Xiangyu, et al.
Published: (2024)

TinyLLaVA: A Framework of Small-scale Large Multimodal Models
by: Zhou, Baichuan, et al.
Published: (2024)

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
by: Xu, Mingze, et al.
Published: (2025)

Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner
by: Shang, Yuzhang, et al.
Published: (2024)

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
by: Xu, Guowei, et al.
Published: (2024)

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence
by: An, Xiang, et al.
Published: (2026)

AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs
by: Xia, Shuhan, et al.
Published: (2025)

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
by: Xu, Mingze, et al.
Published: (2024)

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
by: Zhang, Tao, et al.
Published: (2024)

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
by: Zhang, Shaolei, et al.
Published: (2025)

LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
by: Sun, Boyuan, et al.
Published: (2025)

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations
by: Gao, Mingze, et al.
Published: (2024)

TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs
by: Wang, Juntong, et al.
Published: (2025)

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
by: Zhang, Boqiang, et al.
Published: (2025)

TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings
by: Yan, Dawei, et al.
Published: (2024)

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
by: Zhang, Yi-Fan, et al.
Published: (2024)

VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs
by: Li, Can, et al.
Published: (2025)