Saved in:
| Main Authors: | Wang, Xiyao, Li, Chunyuan, Yang, Jianwei, Zhang, Kai, Liu, Bo, Xiong, Tianyi, Huang, Furong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.00676 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LLaVA-Critic: Learning to Evaluate Multimodal Models
by: Xiong, Tianyi, et al.
Published: (2024)
by: Xiong, Tianyi, et al.
Published: (2024)
LLaVA-Video: Video Instruction Tuning With Synthetic Data
by: Zhang, Yuanhan, et al.
Published: (2024)
by: Zhang, Yuanhan, et al.
Published: (2024)
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
by: An, Xiang, et al.
Published: (2025)
by: An, Xiang, et al.
Published: (2025)
MC-LLaVA: Multi-Concept Personalized Vision-Language Model
by: An, Ruichuan, et al.
Published: (2024)
by: An, Ruichuan, et al.
Published: (2024)
LLaVA-OneVision: Easy Visual Task Transfer
by: Li, Bo, et al.
Published: (2024)
by: Li, Bo, et al.
Published: (2024)
TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings
by: Yan, Dawei, et al.
Published: (2024)
by: Yan, Dawei, et al.
Published: (2024)
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
by: Shu, Fangxun, et al.
Published: (2024)
by: Shu, Fangxun, et al.
Published: (2024)
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
by: Xu, Mingze, et al.
Published: (2024)
by: Xu, Mingze, et al.
Published: (2024)
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
by: Li, Feng, et al.
Published: (2024)
by: Li, Feng, et al.
Published: (2024)
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
by: Zhao, Xiangyu, et al.
Published: (2024)
by: Zhao, Xiangyu, et al.
Published: (2024)
MC-LLaVA: Multi-Concept Personalized Vision-Language Model
by: An, Ruichuan, et al.
Published: (2025)
by: An, Ruichuan, et al.
Published: (2025)
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
by: Lin, Bin, et al.
Published: (2024)
by: Lin, Bin, et al.
Published: (2024)
LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence
by: An, Xiang, et al.
Published: (2026)
by: An, Xiang, et al.
Published: (2026)
Cosmos-LLaVA: Chatting with the Visual Cosmos-LLaVA: Görselle Sohbet Etmek
by: Zeer, Ahmed, et al.
Published: (2024)
by: Zeer, Ahmed, et al.
Published: (2024)
LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models
by: Zheng, Pengcheng, et al.
Published: (2026)
by: Zheng, Pengcheng, et al.
Published: (2026)
u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model
by: Xu, Jinjin, et al.
Published: (2023)
by: Xu, Jinjin, et al.
Published: (2023)
Continual LLaVA: Continual Instruction Tuning in Large Vision-Language Models
by: Cao, Meng, et al.
Published: (2024)
by: Cao, Meng, et al.
Published: (2024)
TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations
by: Gao, Mingze, et al.
Published: (2024)
by: Gao, Mingze, et al.
Published: (2024)
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models
by: Cai, Yuxuan, et al.
Published: (2024)
by: Cai, Yuxuan, et al.
Published: (2024)
LLaVA-c: Continual Improved Visual Instruction Tuning
by: Liu, Wenzhuo, et al.
Published: (2025)
by: Liu, Wenzhuo, et al.
Published: (2025)
Enhance Image-to-Image Generation with LLaVA-generated Prompts
by: Ding, Zhicheng, et al.
Published: (2024)
by: Ding, Zhicheng, et al.
Published: (2024)
Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding
by: Liang, Yongyuan, et al.
Published: (2025)
by: Liang, Yongyuan, et al.
Published: (2025)
When LLaVA Meets Objects: Token Composition for Vision-Language-Models
by: Jahagirdar, Soumya, et al.
Published: (2026)
by: Jahagirdar, Soumya, et al.
Published: (2026)
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
by: Zhang, Yi-Fan, et al.
Published: (2024)
by: Zhang, Yi-Fan, et al.
Published: (2024)
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
by: Ye, Xubing, et al.
Published: (2024)
by: Ye, Xubing, et al.
Published: (2024)
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
by: Xu, Guowei, et al.
Published: (2024)
by: Xu, Guowei, et al.
Published: (2024)
LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models
by: Zhang, Ruiyi, et al.
Published: (2024)
by: Zhang, Ruiyi, et al.
Published: (2024)
LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model
by: Zhu, Yichen, et al.
Published: (2024)
by: Zhu, Yichen, et al.
Published: (2024)
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
by: Li, Hongyu, et al.
Published: (2025)
by: Li, Hongyu, et al.
Published: (2025)
R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest
by: Chen, Xupeng, et al.
Published: (2024)
by: Chen, Xupeng, et al.
Published: (2024)
LLaVA-SLT: Visual Language Tuning for Sign Language Translation
by: Liang, Han, et al.
Published: (2024)
by: Liang, Han, et al.
Published: (2024)
LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model
by: Sun, Tao, et al.
Published: (2025)
by: Sun, Tao, et al.
Published: (2025)
HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
by: Huang, Runhui, et al.
Published: (2024)
by: Huang, Runhui, et al.
Published: (2024)
Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models
by: Zamini, Mohamad, et al.
Published: (2025)
by: Zamini, Mohamad, et al.
Published: (2025)
Why do LLaVA Vision-Language Models Reply to Images in English?
by: Hinck, Musashi, et al.
Published: (2024)
by: Hinck, Musashi, et al.
Published: (2024)
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
by: Yuan, Haobo, et al.
Published: (2025)
by: Yuan, Haobo, et al.
Published: (2025)
WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image
by: Liang, Yuci, et al.
Published: (2024)
by: Liang, Yuci, et al.
Published: (2024)
LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs
by: Lou, Haoran, et al.
Published: (2025)
by: Lou, Haoran, et al.
Published: (2025)
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
by: Lin, Bin, et al.
Published: (2023)
by: Lin, Bin, et al.
Published: (2023)
LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs
by: Shen, Leqi, et al.
Published: (2025)
by: Shen, Leqi, et al.
Published: (2025)
Similar Items
-
LLaVA-Critic: Learning to Evaluate Multimodal Models
by: Xiong, Tianyi, et al.
Published: (2024) -
LLaVA-Video: Video Instruction Tuning With Synthetic Data
by: Zhang, Yuanhan, et al.
Published: (2024) -
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
by: An, Xiang, et al.
Published: (2025) -
MC-LLaVA: Multi-Concept Personalized Vision-Language Model
by: An, Ruichuan, et al.
Published: (2024) -
LLaVA-OneVision: Easy Visual Task Transfer
by: Li, Bo, et al.
Published: (2024)