Saved in:
| Main Authors: | Zhang, Xingjian, Weng, Xi, Yue, Yihao, Fan, Zhaoxin, Wu, Wenjun, Huang, Lei |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2501.15513 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
by: Zhang, Xingjian, et al.
Published: (2025)
by: Zhang, Xingjian, et al.
Published: (2025)
EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use
by: Wen, Siwei, et al.
Published: (2026)
by: Wen, Siwei, et al.
Published: (2026)
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
by: Shu, Fangxun, et al.
Published: (2024)
by: Shu, Fangxun, et al.
Published: (2024)
Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs
by: Zhang, Zicheng, et al.
Published: (2024)
by: Zhang, Zicheng, et al.
Published: (2024)
3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer
by: Deng, Jiajun, et al.
Published: (2025)
by: Deng, Jiajun, et al.
Published: (2025)
LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding
by: Zhou, Hanyu, et al.
Published: (2025)
by: Zhou, Hanyu, et al.
Published: (2025)
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
by: Zhou, Baichuan, et al.
Published: (2024)
by: Zhou, Baichuan, et al.
Published: (2024)
TinyLLaVA Factory: A Modularized Codebase for Small-scale Large Multimodal Models
by: Jia, Junlong, et al.
Published: (2024)
by: Jia, Junlong, et al.
Published: (2024)
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
by: Yuan, Haobo, et al.
Published: (2025)
by: Yuan, Haobo, et al.
Published: (2025)
TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations
by: Gao, Mingze, et al.
Published: (2024)
by: Gao, Mingze, et al.
Published: (2024)
LLaVA-Video: Video Instruction Tuning With Synthetic Data
by: Zhang, Yuanhan, et al.
Published: (2024)
by: Zhang, Yuanhan, et al.
Published: (2024)
LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs
by: Zhou, Hanyu, et al.
Published: (2025)
by: Zhou, Hanyu, et al.
Published: (2025)
Text-Conditioned Resampler For Long Form Video Understanding
by: Korbar, Bruno, et al.
Published: (2023)
by: Korbar, Bruno, et al.
Published: (2023)
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
by: Xu, Lin, et al.
Published: (2024)
by: Xu, Lin, et al.
Published: (2024)
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
by: Zhu, Chenming, et al.
Published: (2024)
by: Zhu, Chenming, et al.
Published: (2024)
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
by: Sun, Boyuan, et al.
Published: (2025)
by: Sun, Boyuan, et al.
Published: (2025)
LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs
by: Shen, Leqi, et al.
Published: (2025)
by: Shen, Leqi, et al.
Published: (2025)
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
by: Zhang, Boqiang, et al.
Published: (2025)
by: Zhang, Boqiang, et al.
Published: (2025)
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
by: Xu, Mingze, et al.
Published: (2025)
by: Xu, Mingze, et al.
Published: (2025)
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
by: Lin, Bin, et al.
Published: (2023)
by: Lin, Bin, et al.
Published: (2023)
LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval
by: Lu, Weiheng, et al.
Published: (2024)
by: Lu, Weiheng, et al.
Published: (2024)
LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning
by: Li, Jiajie, et al.
Published: (2024)
by: Li, Jiajie, et al.
Published: (2024)
TennisExpert: Towards Expert-Level Analytical Sports Video Understanding
by: Liu, Zhaoyu, et al.
Published: (2026)
by: Liu, Zhaoyu, et al.
Published: (2026)
CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario
by: Duan, Zhizhao, et al.
Published: (2024)
by: Duan, Zhizhao, et al.
Published: (2024)
IQViC: In-context, Question Adaptive Vision Compressor for Long-term Video Understanding LMMs
by: Yamao, Sosuke, et al.
Published: (2024)
by: Yamao, Sosuke, et al.
Published: (2024)
Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
by: Fan, Yue, et al.
Published: (2024)
by: Fan, Yue, et al.
Published: (2024)
How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs
by: Khattak, Muhammad Uzair, et al.
Published: (2024)
by: Khattak, Muhammad Uzair, et al.
Published: (2024)
VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges
by: Wang, Yuxuan, et al.
Published: (2024)
by: Wang, Yuxuan, et al.
Published: (2024)
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs
by: Bharadwaj, Rohit, et al.
Published: (2024)
by: Bharadwaj, Rohit, et al.
Published: (2024)
ViLLa: Video Reasoning Segmentation with Large Language Model
by: Zheng, Rongkun, et al.
Published: (2024)
by: Zheng, Rongkun, et al.
Published: (2024)
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
by: Zhao, Xiangyu, et al.
Published: (2024)
by: Zhao, Xiangyu, et al.
Published: (2024)
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
by: Fan, Yue, et al.
Published: (2024)
by: Fan, Yue, et al.
Published: (2024)
End-to-End Training for Autoregressive Video Diffusion via Self-Resampling
by: Guo, Yuwei, et al.
Published: (2025)
by: Guo, Yuwei, et al.
Published: (2025)
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
by: Xu, Mingze, et al.
Published: (2024)
by: Xu, Mingze, et al.
Published: (2024)
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
by: Cheng, Zesen, et al.
Published: (2024)
by: Cheng, Zesen, et al.
Published: (2024)
LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs
by: Lou, Haoran, et al.
Published: (2025)
by: Lou, Haoran, et al.
Published: (2025)
TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs
by: Wang, Juntong, et al.
Published: (2025)
by: Wang, Juntong, et al.
Published: (2025)
LLaVA-Critic: Learning to Evaluate Multimodal Models
by: Xiong, Tianyi, et al.
Published: (2024)
by: Xiong, Tianyi, et al.
Published: (2024)
Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models
by: Jin, Juseong, et al.
Published: (2024)
by: Jin, Juseong, et al.
Published: (2024)
Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner
by: Shang, Yuzhang, et al.
Published: (2024)
by: Shang, Yuzhang, et al.
Published: (2024)
Similar Items
-
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
by: Zhang, Xingjian, et al.
Published: (2025) -
EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use
by: Wen, Siwei, et al.
Published: (2026) -
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
by: Shu, Fangxun, et al.
Published: (2024) -
Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs
by: Zhang, Zicheng, et al.
Published: (2024) -
3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer
by: Deng, Jiajun, et al.
Published: (2025)