Saved in:
| Main Authors: | Ge, Chunjiang, Cheng, Sijie, Wang, Ziming, Yuan, Jiale, Gao, Yuan, Song, Jun, Song, Shiji, Huang, Gao, Zheng, Bo |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2405.15738 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training
by: Wang, Yulin, et al.
Published: (2024)
by: Wang, Yulin, et al.
Published: (2024)
Demystify Mamba in Vision: A Linear Attention Perspective
by: Han, Dongchen, et al.
Published: (2024)
by: Han, Dongchen, et al.
Published: (2024)
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
by: Xu, Ruyi, et al.
Published: (2024)
by: Xu, Ruyi, et al.
Published: (2024)
Cross-Modal Adapter for Vision-Language Retrieval
by: Jiang, Haojun, et al.
Published: (2022)
by: Jiang, Haojun, et al.
Published: (2022)
GSVA: Generalized Segmentation via Multimodal Large Language Models
by: Xia, Zhuofan, et al.
Published: (2023)
by: Xia, Zhuofan, et al.
Published: (2023)
Probabilistic Contrastive Learning for Long-Tailed Visual Recognition
by: Du, Chaoqun, et al.
Published: (2024)
by: Du, Chaoqun, et al.
Published: (2024)
BAPE: Learning an Explicit Bayes Classifier for Long-tailed Visual Recognition
by: Du, Chaoqun, et al.
Published: (2025)
by: Du, Chaoqun, et al.
Published: (2025)
HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model
by: Guo, Haiyang, et al.
Published: (2025)
by: Guo, Haiyang, et al.
Published: (2025)
AVG-LLaVA: An Efficient Large Multimodal Model with Adaptive Visual Granularity
by: Lan, Zhibin, et al.
Published: (2024)
by: Lan, Zhibin, et al.
Published: (2024)
LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer
by: Zhang, Yipeng, et al.
Published: (2024)
by: Zhang, Yipeng, et al.
Published: (2024)
Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment
by: Guo, Jiayi, et al.
Published: (2024)
by: Guo, Jiayi, et al.
Published: (2024)
LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
by: Cocchi, Federico, et al.
Published: (2025)
by: Cocchi, Federico, et al.
Published: (2025)
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
by: Li, Hongyu, et al.
Published: (2025)
by: Li, Hongyu, et al.
Published: (2025)
VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance
by: Wang, Teng, et al.
Published: (2025)
by: Wang, Teng, et al.
Published: (2025)
PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training
by: Chen, Cong, et al.
Published: (2025)
by: Chen, Cong, et al.
Published: (2025)
WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image
by: Liang, Yuci, et al.
Published: (2024)
by: Liang, Yuci, et al.
Published: (2024)
LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models
by: Zheng, Pengcheng, et al.
Published: (2026)
by: Zheng, Pengcheng, et al.
Published: (2026)
VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks
by: Chu, Xiangxiang, et al.
Published: (2024)
by: Chu, Xiangxiang, et al.
Published: (2024)
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
by: Xu, Guowei, et al.
Published: (2024)
by: Xu, Guowei, et al.
Published: (2024)
LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence
by: An, Xiang, et al.
Published: (2026)
by: An, Xiang, et al.
Published: (2026)
SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition
by: Cheng, Zebang, et al.
Published: (2024)
by: Cheng, Zebang, et al.
Published: (2024)
iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
by: Hu, Lianyu, et al.
Published: (2024)
by: Hu, Lianyu, et al.
Published: (2024)
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
by: Cai, Mu, et al.
Published: (2023)
by: Cai, Mu, et al.
Published: (2023)
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models
by: Zhang, Wenqiao, et al.
Published: (2024)
by: Zhang, Wenqiao, et al.
Published: (2024)
LLaVA-Critic: Learning to Evaluate Multimodal Models
by: Xiong, Tianyi, et al.
Published: (2024)
by: Xiong, Tianyi, et al.
Published: (2024)
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
by: Zhou, Baichuan, et al.
Published: (2024)
by: Zhou, Baichuan, et al.
Published: (2024)
Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs
by: Caffagni, Davide, et al.
Published: (2024)
by: Caffagni, Davide, et al.
Published: (2024)
Targeted Visualization of the Backbone of Encoder LLMs
by: Roberts, Isaac, et al.
Published: (2024)
by: Roberts, Isaac, et al.
Published: (2024)
Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models
by: Liu, Xuyang, et al.
Published: (2025)
by: Liu, Xuyang, et al.
Published: (2025)
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
by: Yue, Yang, et al.
Published: (2024)
by: Yue, Yang, et al.
Published: (2024)
LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating
by: Deng, Chao, et al.
Published: (2024)
by: Deng, Chao, et al.
Published: (2024)
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
by: Lin, Bin, et al.
Published: (2024)
by: Lin, Bin, et al.
Published: (2024)
LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs
by: Sun, Shichu, et al.
Published: (2025)
by: Sun, Shichu, et al.
Published: (2025)
LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?
by: Li, Bangyan, et al.
Published: (2025)
by: Li, Bangyan, et al.
Published: (2025)
A Reinforcement-Learning-Based Multiple-Column Selection Strategy for Column Generation
by: Yuan, Haofeng, et al.
Published: (2023)
by: Yuan, Haofeng, et al.
Published: (2023)
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition
by: Ding, Xiaohan, et al.
Published: (2023)
by: Ding, Xiaohan, et al.
Published: (2023)
Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models
by: Malik, Hashmat Shadab, et al.
Published: (2025)
by: Malik, Hashmat Shadab, et al.
Published: (2025)
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
by: Yuan, Zhengqing, et al.
Published: (2023)
by: Yuan, Zhengqing, et al.
Published: (2023)
Advancing Generalization in PINNs through Latent-Space Representations
by: Wang, Honghui, et al.
Published: (2024)
by: Wang, Honghui, et al.
Published: (2024)
Meta-Semi: A Meta-learning Approach for Semi-supervised Learning
by: Wang, Yulin, et al.
Published: (2020)
by: Wang, Yulin, et al.
Published: (2020)
Similar Items
-
EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training
by: Wang, Yulin, et al.
Published: (2024) -
Demystify Mamba in Vision: A Linear Attention Perspective
by: Han, Dongchen, et al.
Published: (2024) -
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
by: Xu, Ruyi, et al.
Published: (2024) -
Cross-Modal Adapter for Vision-Language Retrieval
by: Jiang, Haojun, et al.
Published: (2022) -
GSVA: Generalized Segmentation via Multimodal Large Language Models
by: Xia, Zhuofan, et al.
Published: (2023)