:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Xiyao, Li, Chunyuan, Yang, Jianwei, Zhang, Kai, Liu, Bo, Xiong, Tianyi, Huang, Furong
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Machine Learning
Online Access:	https://arxiv.org/abs/2509.00676
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

LLaVA-Critic: Learning to Evaluate Multimodal Models
by: Xiong, Tianyi, et al.
Published: (2024)

LLaVA-Video: Video Instruction Tuning With Synthetic Data
by: Zhang, Yuanhan, et al.
Published: (2024)

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
by: An, Xiang, et al.
Published: (2025)

MC-LLaVA: Multi-Concept Personalized Vision-Language Model
by: An, Ruichuan, et al.
Published: (2024)

LLaVA-OneVision: Easy Visual Task Transfer
by: Li, Bo, et al.
Published: (2024)

TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings
by: Yan, Dawei, et al.
Published: (2024)

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
by: Shu, Fangxun, et al.
Published: (2024)

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
by: Xu, Mingze, et al.
Published: (2024)

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
by: Li, Feng, et al.
Published: (2024)

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
by: Zhao, Xiangyu, et al.
Published: (2024)

MC-LLaVA: Multi-Concept Personalized Vision-Language Model
by: An, Ruichuan, et al.
Published: (2025)

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
by: Lin, Bin, et al.
Published: (2024)

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence
by: An, Xiang, et al.
Published: (2026)

Cosmos-LLaVA: Chatting with the Visual Cosmos-LLaVA: Görselle Sohbet Etmek
by: Zeer, Ahmed, et al.
Published: (2024)

LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models
by: Zheng, Pengcheng, et al.
Published: (2026)

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model
by: Xu, Jinjin, et al.
Published: (2023)

Continual LLaVA: Continual Instruction Tuning in Large Vision-Language Models
by: Cao, Meng, et al.
Published: (2024)

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations
by: Gao, Mingze, et al.
Published: (2024)

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models
by: Cai, Yuxuan, et al.
Published: (2024)

LLaVA-c: Continual Improved Visual Instruction Tuning
by: Liu, Wenzhuo, et al.
Published: (2025)

Enhance Image-to-Image Generation with LLaVA-generated Prompts
by: Ding, Zhicheng, et al.
Published: (2024)

Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding
by: Liang, Yongyuan, et al.
Published: (2025)

When LLaVA Meets Objects: Token Composition for Vision-Language-Models
by: Jahagirdar, Soumya, et al.
Published: (2026)

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
by: Zhang, Yi-Fan, et al.
Published: (2024)

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
by: Ye, Xubing, et al.
Published: (2024)

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
by: Xu, Guowei, et al.
Published: (2024)

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models
by: Zhang, Ruiyi, et al.
Published: (2024)

LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model
by: Zhu, Yichen, et al.
Published: (2024)

LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
by: Li, Hongyu, et al.
Published: (2025)

R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest
by: Chen, Xupeng, et al.
Published: (2024)

LLaVA-SLT: Visual Language Tuning for Sign Language Translation
by: Liang, Han, et al.
Published: (2024)

LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model
by: Sun, Tao, et al.
Published: (2025)

HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
by: Huang, Runhui, et al.
Published: (2024)

Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models
by: Zamini, Mohamad, et al.
Published: (2025)

Why do LLaVA Vision-Language Models Reply to Images in English?
by: Hinck, Musashi, et al.
Published: (2024)

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
by: Yuan, Haobo, et al.
Published: (2025)

WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image
by: Liang, Yuci, et al.
Published: (2024)

LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs
by: Lou, Haoran, et al.
Published: (2025)

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
by: Lin, Bin, et al.
Published: (2023)

LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs
by: Shen, Leqi, et al.
Published: (2025)