:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhu, Yuke, Xie, Chi, Liang, Shuang, Zheng, Bo, Guo, Sheng
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2411.14228
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs
by: Sun, Shichu, et al.
Published: (2025)

LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information
by: Wang, Ke, et al.
Published: (2024)

LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models
by: Zheng, Pengcheng, et al.
Published: (2026)

CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario
by: Duan, Zhizhao, et al.
Published: (2024)

LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs
by: Lou, Haoran, et al.
Published: (2025)

LLaVA-c: Continual Improved Visual Instruction Tuning
by: Liu, Wenzhuo, et al.
Published: (2025)

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
by: Ge, Chunjiang, et al.
Published: (2024)

Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models
by: Zamini, Mohamad, et al.
Published: (2025)

TokenSeg: Efficient 3D Medical Image Segmentation via Hierarchical Visual Token Compression
by: Zeng, Sen, et al.
Published: (2026)

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
by: Tian, Kaibin, et al.
Published: (2024)

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs
by: Sun, Boyuan, et al.
Published: (2025)

LLaVA-SLT: Visual Language Tuning for Sign Language Translation
by: Liang, Han, et al.
Published: (2024)

LLaVA-OneVision: Easy Visual Task Transfer
by: Li, Bo, et al.
Published: (2024)

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
by: Fang, Kechen, et al.
Published: (2026)

Efficient Token Pruning for LLaDA-V
by: Wan, Zhewen, et al.
Published: (2026)

Improving Autoregressive Image Generation through Coarse-to-Fine Token Prediction
by: Guo, Ziyao, et al.
Published: (2025)

When LLaVA Meets Objects: Token Composition for Vision-Language-Models
by: Jahagirdar, Soumya, et al.
Published: (2026)

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
by: Ye, Xubing, et al.
Published: (2024)

EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models
by: Liang, Yinan, et al.
Published: (2025)

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
by: Lin, Bin, et al.
Published: (2023)

Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning
by: Tong, Enwei, et al.
Published: (2026)

PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training
by: Chen, Cong, et al.
Published: (2025)

iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
by: Hu, Lianyu, et al.
Published: (2024)

VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
by: Zhu, Jiaying, et al.
Published: (2025)

TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
by: Qu, Tingyu, et al.
Published: (2024)

HaltingVT: Adaptive Token Halting Transformer for Efficient Video Recognition
by: Wu, Qian, et al.
Published: (2024)

VASparse: Towards Efficient Visual Hallucination Mitigation via Visual-Aware Token Sparsification
by: Zhuang, Xianwei, et al.
Published: (2025)

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
by: Shang, Yuzhang, et al.
Published: (2024)

Cosmos-LLaVA: Chatting with the Visual Cosmos-LLaVA: Görselle Sohbet Etmek
by: Zeer, Ahmed, et al.
Published: (2024)

GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing
by: Elgendy, Hosam, et al.
Published: (2024)

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
by: Zhao, Xiangyu, et al.
Published: (2024)

LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model
by: Zhu, Yichen, et al.
Published: (2024)

DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding
by: Zhang, Hongzhi, et al.
Published: (2025)

LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer
by: Zhang, Yipeng, et al.
Published: (2024)

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
by: Shu, Fangxun, et al.
Published: (2024)

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
by: Zhang, Shaolei, et al.
Published: (2025)

An Efficient Token Compression Framework for Visual Object Tracking
by: Wu, Weijing, et al.
Published: (2026)

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
by: Zhu, Chenming, et al.
Published: (2024)

AVG-LLaVA: An Efficient Large Multimodal Model with Adaptive Visual Granularity
by: Lan, Zhibin, et al.
Published: (2024)

FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants
by: Bhosale, Mahesh, et al.
Published: (2026)