Saved in:
| Main Authors: | Zhang, Xingjian, Wen, Siwei, Wu, Wenjun, Huang, Lei |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.09641 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
TinyLLaVA-Video: Towards Smaller LMMs for Video Understanding with Group Resampler
by: Zhang, Xingjian, et al.
Published: (2025)
by: Zhang, Xingjian, et al.
Published: (2025)
EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use
by: Wen, Siwei, et al.
Published: (2026)
by: Wen, Siwei, et al.
Published: (2026)
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
by: Shu, Fangxun, et al.
Published: (2024)
by: Shu, Fangxun, et al.
Published: (2024)
3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer
by: Deng, Jiajun, et al.
Published: (2025)
by: Deng, Jiajun, et al.
Published: (2025)
LLaVA-Video: Video Instruction Tuning With Synthetic Data
by: Zhang, Yuanhan, et al.
Published: (2024)
by: Zhang, Yuanhan, et al.
Published: (2024)
Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs
by: Zhang, Zicheng, et al.
Published: (2024)
by: Zhang, Zicheng, et al.
Published: (2024)
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
by: Zhu, Chenming, et al.
Published: (2024)
by: Zhu, Chenming, et al.
Published: (2024)
LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding
by: Zhou, Hanyu, et al.
Published: (2025)
by: Zhou, Hanyu, et al.
Published: (2025)
How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs
by: Khattak, Muhammad Uzair, et al.
Published: (2024)
by: Khattak, Muhammad Uzair, et al.
Published: (2024)
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
by: Xu, Lin, et al.
Published: (2024)
by: Xu, Lin, et al.
Published: (2024)
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
by: Yuan, Haobo, et al.
Published: (2025)
by: Yuan, Haobo, et al.
Published: (2025)
LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs
by: Shen, Leqi, et al.
Published: (2025)
by: Shen, Leqi, et al.
Published: (2025)
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
by: Lin, Bin, et al.
Published: (2023)
by: Lin, Bin, et al.
Published: (2023)
LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval
by: Lu, Weiheng, et al.
Published: (2024)
by: Lu, Weiheng, et al.
Published: (2024)
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos
by: Zhang, Jianrui, et al.
Published: (2024)
by: Zhang, Jianrui, et al.
Published: (2024)
Video-R1: Reinforcing Video Reasoning in MLLMs
by: Feng, Kaituo, et al.
Published: (2025)
by: Feng, Kaituo, et al.
Published: (2025)
LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning
by: Li, Jiajie, et al.
Published: (2024)
by: Li, Jiajie, et al.
Published: (2024)
ViLLa: Video Reasoning Segmentation with Large Language Model
by: Zheng, Rongkun, et al.
Published: (2024)
by: Zheng, Rongkun, et al.
Published: (2024)
Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought
by: Huang, Chao, et al.
Published: (2025)
by: Huang, Chao, et al.
Published: (2025)
MMSearch-R1: Incentivizing LMMs to Search
by: Wu, Jinming, et al.
Published: (2025)
by: Wu, Jinming, et al.
Published: (2025)
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs
by: Bharadwaj, Rohit, et al.
Published: (2024)
by: Bharadwaj, Rohit, et al.
Published: (2024)
LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs
by: Zhou, Hanyu, et al.
Published: (2025)
by: Zhou, Hanyu, et al.
Published: (2025)
ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos
by: Vuong, Trinh T. L., et al.
Published: (2025)
by: Vuong, Trinh T. L., et al.
Published: (2025)
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
by: Zhao, Xiangyu, et al.
Published: (2024)
by: Zhao, Xiangyu, et al.
Published: (2024)
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
by: Zhou, Baichuan, et al.
Published: (2024)
by: Zhou, Baichuan, et al.
Published: (2024)
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
by: Xu, Mingze, et al.
Published: (2025)
by: Xu, Mingze, et al.
Published: (2025)
Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner
by: Shang, Yuzhang, et al.
Published: (2024)
by: Shang, Yuzhang, et al.
Published: (2024)
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
by: Xu, Guowei, et al.
Published: (2024)
by: Xu, Guowei, et al.
Published: (2024)
LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence
by: An, Xiang, et al.
Published: (2026)
by: An, Xiang, et al.
Published: (2026)
AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs
by: Xia, Shuhan, et al.
Published: (2025)
by: Xia, Shuhan, et al.
Published: (2025)
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
by: Xu, Mingze, et al.
Published: (2024)
by: Xu, Mingze, et al.
Published: (2024)
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
by: Zhang, Tao, et al.
Published: (2024)
by: Zhang, Tao, et al.
Published: (2024)
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
by: Zhang, Shaolei, et al.
Published: (2025)
by: Zhang, Shaolei, et al.
Published: (2025)
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
by: Sun, Boyuan, et al.
Published: (2025)
by: Sun, Boyuan, et al.
Published: (2025)
TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations
by: Gao, Mingze, et al.
Published: (2024)
by: Gao, Mingze, et al.
Published: (2024)
TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs
by: Wang, Juntong, et al.
Published: (2025)
by: Wang, Juntong, et al.
Published: (2025)
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
by: Zhang, Boqiang, et al.
Published: (2025)
by: Zhang, Boqiang, et al.
Published: (2025)
TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings
by: Yan, Dawei, et al.
Published: (2024)
by: Yan, Dawei, et al.
Published: (2024)
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
by: Zhang, Yi-Fan, et al.
Published: (2024)
by: Zhang, Yi-Fan, et al.
Published: (2024)
VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs
by: Li, Can, et al.
Published: (2025)
by: Li, Can, et al.
Published: (2025)
Similar Items
-
TinyLLaVA-Video: Towards Smaller LMMs for Video Understanding with Group Resampler
by: Zhang, Xingjian, et al.
Published: (2025) -
EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use
by: Wen, Siwei, et al.
Published: (2026) -
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
by: Shu, Fangxun, et al.
Published: (2024) -
3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer
by: Deng, Jiajun, et al.
Published: (2025) -
LLaVA-Video: Video Instruction Tuning With Synthetic Data
by: Zhang, Yuanhan, et al.
Published: (2024)