:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ge, Chunjiang, Cheng, Sijie, Wang, Ziming, Yuan, Jiale, Gao, Yuan, Song, Jun, Song, Shiji, Huang, Gao, Zheng, Bo
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2405.15738
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training
by: Wang, Yulin, et al.
Published: (2024)

Demystify Mamba in Vision: A Linear Attention Perspective
by: Han, Dongchen, et al.
Published: (2024)

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
by: Xu, Ruyi, et al.
Published: (2024)

Cross-Modal Adapter for Vision-Language Retrieval
by: Jiang, Haojun, et al.
Published: (2022)

GSVA: Generalized Segmentation via Multimodal Large Language Models
by: Xia, Zhuofan, et al.
Published: (2023)

Probabilistic Contrastive Learning for Long-Tailed Visual Recognition
by: Du, Chaoqun, et al.
Published: (2024)

BAPE: Learning an Explicit Bayes Classifier for Long-tailed Visual Recognition
by: Du, Chaoqun, et al.
Published: (2025)

HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model
by: Guo, Haiyang, et al.
Published: (2025)

AVG-LLaVA: An Efficient Large Multimodal Model with Adaptive Visual Granularity
by: Lan, Zhibin, et al.
Published: (2024)

LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer
by: Zhang, Yipeng, et al.
Published: (2024)

Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment
by: Guo, Jiayi, et al.
Published: (2024)

LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
by: Cocchi, Federico, et al.
Published: (2025)

LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
by: Li, Hongyu, et al.
Published: (2025)

VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance
by: Wang, Teng, et al.
Published: (2025)

PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training
by: Chen, Cong, et al.
Published: (2025)

WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image
by: Liang, Yuci, et al.
Published: (2024)

LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models
by: Zheng, Pengcheng, et al.
Published: (2026)

VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks
by: Chu, Xiangxiang, et al.
Published: (2024)

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
by: Xu, Guowei, et al.
Published: (2024)

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence
by: An, Xiang, et al.
Published: (2026)

SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition
by: Cheng, Zebang, et al.
Published: (2024)

iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
by: Hu, Lianyu, et al.
Published: (2024)

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
by: Cai, Mu, et al.
Published: (2023)

HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models
by: Zhang, Wenqiao, et al.
Published: (2024)

LLaVA-Critic: Learning to Evaluate Multimodal Models
by: Xiong, Tianyi, et al.
Published: (2024)

TinyLLaVA: A Framework of Small-scale Large Multimodal Models
by: Zhou, Baichuan, et al.
Published: (2024)

Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs
by: Caffagni, Davide, et al.
Published: (2024)

Targeted Visualization of the Backbone of Encoder LLMs
by: Roberts, Isaac, et al.
Published: (2024)

Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models
by: Liu, Xuyang, et al.
Published: (2025)

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
by: Yue, Yang, et al.
Published: (2024)

LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating
by: Deng, Chao, et al.
Published: (2024)

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
by: Lin, Bin, et al.
Published: (2024)

LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs
by: Sun, Shichu, et al.
Published: (2025)

LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?
by: Li, Bangyan, et al.
Published: (2025)

A Reinforcement-Learning-Based Multiple-Column Selection Strategy for Column Generation
by: Yuan, Haofeng, et al.
Published: (2023)

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition
by: Ding, Xiaohan, et al.
Published: (2023)

Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models
by: Malik, Hashmat Shadab, et al.
Published: (2025)

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
by: Yuan, Zhengqing, et al.
Published: (2023)

Advancing Generalization in PINNs through Latent-Space Representations
by: Wang, Honghui, et al.
Published: (2024)

Meta-Semi: A Meta-learning Approach for Semi-supervised Learning
by: Wang, Yulin, et al.
Published: (2020)