:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Xingjian, Weng, Xi, Yue, Yihao, Fan, Zhaoxin, Wu, Wenjun, Huang, Lei
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2501.15513
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
by: Zhang, Xingjian, et al.
Published: (2025)

EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use
by: Wen, Siwei, et al.
Published: (2026)

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
by: Shu, Fangxun, et al.
Published: (2024)

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs
by: Zhang, Zicheng, et al.
Published: (2024)

3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer
by: Deng, Jiajun, et al.
Published: (2025)

LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding
by: Zhou, Hanyu, et al.
Published: (2025)

TinyLLaVA: A Framework of Small-scale Large Multimodal Models
by: Zhou, Baichuan, et al.
Published: (2024)

TinyLLaVA Factory: A Modularized Codebase for Small-scale Large Multimodal Models
by: Jia, Junlong, et al.
Published: (2024)

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
by: Yuan, Haobo, et al.
Published: (2025)

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations
by: Gao, Mingze, et al.
Published: (2024)

LLaVA-Video: Video Instruction Tuning With Synthetic Data
by: Zhang, Yuanhan, et al.
Published: (2024)

LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs
by: Zhou, Hanyu, et al.
Published: (2025)

Text-Conditioned Resampler For Long Form Video Understanding
by: Korbar, Bruno, et al.
Published: (2023)

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
by: Xu, Lin, et al.
Published: (2024)

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
by: Zhu, Chenming, et al.
Published: (2024)

LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
by: Sun, Boyuan, et al.
Published: (2025)

LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs
by: Shen, Leqi, et al.
Published: (2025)

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
by: Zhang, Boqiang, et al.
Published: (2025)

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
by: Xu, Mingze, et al.
Published: (2025)

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
by: Lin, Bin, et al.
Published: (2023)

LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval
by: Lu, Weiheng, et al.
Published: (2024)

LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning
by: Li, Jiajie, et al.
Published: (2024)

TennisExpert: Towards Expert-Level Analytical Sports Video Understanding
by: Liu, Zhaoyu, et al.
Published: (2026)

CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario
by: Duan, Zhizhao, et al.
Published: (2024)

IQViC: In-context, Question Adaptive Vision Compressor for Long-term Video Understanding LMMs
by: Yamao, Sosuke, et al.
Published: (2024)

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
by: Fan, Yue, et al.
Published: (2024)

How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs
by: Khattak, Muhammad Uzair, et al.
Published: (2024)

VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges
by: Wang, Yuxuan, et al.
Published: (2024)

VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs
by: Bharadwaj, Rohit, et al.
Published: (2024)

ViLLa: Video Reasoning Segmentation with Large Language Model
by: Zheng, Rongkun, et al.
Published: (2024)

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
by: Zhao, Xiangyu, et al.
Published: (2024)

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
by: Fan, Yue, et al.
Published: (2024)

End-to-End Training for Autoregressive Video Diffusion via Self-Resampling
by: Guo, Yuwei, et al.
Published: (2025)

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
by: Xu, Mingze, et al.
Published: (2024)

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
by: Cheng, Zesen, et al.
Published: (2024)

LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs
by: Lou, Haoran, et al.
Published: (2025)

TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs
by: Wang, Juntong, et al.
Published: (2025)

LLaVA-Critic: Learning to Evaluate Multimodal Models
by: Xiong, Tianyi, et al.
Published: (2024)

Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models
by: Jin, Juseong, et al.
Published: (2024)

Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner
by: Shang, Yuzhang, et al.
Published: (2024)