Saved in:
| Main Authors: | Zhu, Yuke, Xie, Chi, Liang, Shuang, Zheng, Bo, Guo, Sheng |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2411.14228 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs
by: Sun, Shichu, et al.
Published: (2025)
by: Sun, Shichu, et al.
Published: (2025)
LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information
by: Wang, Ke, et al.
Published: (2024)
by: Wang, Ke, et al.
Published: (2024)
LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models
by: Zheng, Pengcheng, et al.
Published: (2026)
by: Zheng, Pengcheng, et al.
Published: (2026)
CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario
by: Duan, Zhizhao, et al.
Published: (2024)
by: Duan, Zhizhao, et al.
Published: (2024)
LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs
by: Lou, Haoran, et al.
Published: (2025)
by: Lou, Haoran, et al.
Published: (2025)
LLaVA-c: Continual Improved Visual Instruction Tuning
by: Liu, Wenzhuo, et al.
Published: (2025)
by: Liu, Wenzhuo, et al.
Published: (2025)
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
by: Ge, Chunjiang, et al.
Published: (2024)
by: Ge, Chunjiang, et al.
Published: (2024)
Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models
by: Zamini, Mohamad, et al.
Published: (2025)
by: Zamini, Mohamad, et al.
Published: (2025)
TokenSeg: Efficient 3D Medical Image Segmentation via Hierarchical Visual Token Compression
by: Zeng, Sen, et al.
Published: (2026)
by: Zeng, Sen, et al.
Published: (2026)
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
by: Tian, Kaibin, et al.
Published: (2024)
by: Tian, Kaibin, et al.
Published: (2024)
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs
by: Sun, Boyuan, et al.
Published: (2025)
by: Sun, Boyuan, et al.
Published: (2025)
LLaVA-SLT: Visual Language Tuning for Sign Language Translation
by: Liang, Han, et al.
Published: (2024)
by: Liang, Han, et al.
Published: (2024)
LLaVA-OneVision: Easy Visual Task Transfer
by: Li, Bo, et al.
Published: (2024)
by: Li, Bo, et al.
Published: (2024)
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
by: Fang, Kechen, et al.
Published: (2026)
by: Fang, Kechen, et al.
Published: (2026)
Efficient Token Pruning for LLaDA-V
by: Wan, Zhewen, et al.
Published: (2026)
by: Wan, Zhewen, et al.
Published: (2026)
Improving Autoregressive Image Generation through Coarse-to-Fine Token Prediction
by: Guo, Ziyao, et al.
Published: (2025)
by: Guo, Ziyao, et al.
Published: (2025)
When LLaVA Meets Objects: Token Composition for Vision-Language-Models
by: Jahagirdar, Soumya, et al.
Published: (2026)
by: Jahagirdar, Soumya, et al.
Published: (2026)
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
by: Ye, Xubing, et al.
Published: (2024)
by: Ye, Xubing, et al.
Published: (2024)
EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models
by: Liang, Yinan, et al.
Published: (2025)
by: Liang, Yinan, et al.
Published: (2025)
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
by: Lin, Bin, et al.
Published: (2023)
by: Lin, Bin, et al.
Published: (2023)
Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning
by: Tong, Enwei, et al.
Published: (2026)
by: Tong, Enwei, et al.
Published: (2026)
PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training
by: Chen, Cong, et al.
Published: (2025)
by: Chen, Cong, et al.
Published: (2025)
iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
by: Hu, Lianyu, et al.
Published: (2024)
by: Hu, Lianyu, et al.
Published: (2024)
VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
by: Zhu, Jiaying, et al.
Published: (2025)
by: Zhu, Jiaying, et al.
Published: (2025)
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
by: Qu, Tingyu, et al.
Published: (2024)
by: Qu, Tingyu, et al.
Published: (2024)
HaltingVT: Adaptive Token Halting Transformer for Efficient Video Recognition
by: Wu, Qian, et al.
Published: (2024)
by: Wu, Qian, et al.
Published: (2024)
VASparse: Towards Efficient Visual Hallucination Mitigation via Visual-Aware Token Sparsification
by: Zhuang, Xianwei, et al.
Published: (2025)
by: Zhuang, Xianwei, et al.
Published: (2025)
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
by: Shang, Yuzhang, et al.
Published: (2024)
by: Shang, Yuzhang, et al.
Published: (2024)
Cosmos-LLaVA: Chatting with the Visual Cosmos-LLaVA: Görselle Sohbet Etmek
by: Zeer, Ahmed, et al.
Published: (2024)
by: Zeer, Ahmed, et al.
Published: (2024)
GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing
by: Elgendy, Hosam, et al.
Published: (2024)
by: Elgendy, Hosam, et al.
Published: (2024)
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
by: Zhao, Xiangyu, et al.
Published: (2024)
by: Zhao, Xiangyu, et al.
Published: (2024)
LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model
by: Zhu, Yichen, et al.
Published: (2024)
by: Zhu, Yichen, et al.
Published: (2024)
DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding
by: Zhang, Hongzhi, et al.
Published: (2025)
by: Zhang, Hongzhi, et al.
Published: (2025)
LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer
by: Zhang, Yipeng, et al.
Published: (2024)
by: Zhang, Yipeng, et al.
Published: (2024)
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
by: Shu, Fangxun, et al.
Published: (2024)
by: Shu, Fangxun, et al.
Published: (2024)
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
by: Zhang, Shaolei, et al.
Published: (2025)
by: Zhang, Shaolei, et al.
Published: (2025)
An Efficient Token Compression Framework for Visual Object Tracking
by: Wu, Weijing, et al.
Published: (2026)
by: Wu, Weijing, et al.
Published: (2026)
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
by: Zhu, Chenming, et al.
Published: (2024)
by: Zhu, Chenming, et al.
Published: (2024)
AVG-LLaVA: An Efficient Large Multimodal Model with Adaptive Visual Granularity
by: Lan, Zhibin, et al.
Published: (2024)
by: Lan, Zhibin, et al.
Published: (2024)
FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants
by: Bhosale, Mahesh, et al.
Published: (2026)
by: Bhosale, Mahesh, et al.
Published: (2026)
Similar Items
-
LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs
by: Sun, Shichu, et al.
Published: (2025) -
LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information
by: Wang, Ke, et al.
Published: (2024) -
LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models
by: Zheng, Pengcheng, et al.
Published: (2026) -
CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario
by: Duan, Zhizhao, et al.
Published: (2024) -
LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs
by: Lou, Haoran, et al.
Published: (2025)