Saved in:
| Main Authors: | Li, Bozhou, Liang, Hao, Meng, Zimo, Zhang, Wentao |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2408.00620 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models
by: Li, Bozhou, et al.
Published: (2025)
by: Li, Bozhou, et al.
Published: (2025)
SynthVLM: Towards High-Quality and Efficient Synthesis of Image-Caption Datasets for Vision-Language Models
by: Liu, Zheng, et al.
Published: (2024)
by: Liu, Zheng, et al.
Published: (2024)
Bigger is not Always Better: Scaling Properties of Latent Diffusion Models
by: Mei, Kangfu, et al.
Published: (2024)
by: Mei, Kangfu, et al.
Published: (2024)
Is Bigger Always Better? Efficiency Analysis in Resource-Constrained Small Object Detection
by: Mbobda-Kuate, Kwame, et al.
Published: (2026)
by: Mbobda-Kuate, Kwame, et al.
Published: (2026)
Beyond the Vision Encoder: Identifying and Mitigating Spatial Bias in Large Vision-Language Models
by: Zhu, Yingjie, et al.
Published: (2025)
by: Zhu, Yingjie, et al.
Published: (2025)
VEGAS: Mitigating Hallucinations in Large Vision-Language Models via Vision-Encoder Attention Guided Adaptive Steering
by: Wang, Zihu, et al.
Published: (2025)
by: Wang, Zihu, et al.
Published: (2025)
CLIP-Adapter: Better Vision-Language Models with Feature Adapters
by: Gao, Peng, et al.
Published: (2021)
by: Gao, Peng, et al.
Published: (2021)
A Survey of Multimodal Large Language Model from A Data-centric Perspective
by: Bai, Tianyi, et al.
Published: (2024)
by: Bai, Tianyi, et al.
Published: (2024)
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models
by: Sun, Weigao, et al.
Published: (2025)
by: Sun, Weigao, et al.
Published: (2025)
MathScape: Benchmarking Multimodal Large Language Models in Real-World Mathematical Contexts
by: Liang, Hao, et al.
Published: (2024)
by: Liang, Hao, et al.
Published: (2024)
EVQAScore: A Fine-grained Metric for Video Question Answering Data Quality Evaluation
by: Liang, Hao, et al.
Published: (2024)
by: Liang, Hao, et al.
Published: (2024)
Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving
by: Li, Yue, et al.
Published: (2025)
by: Li, Yue, et al.
Published: (2025)
Diving into Mitigating Hallucinations from a Vision Perspective for Large Vision-Language Models
by: Wang, Weihang, et al.
Published: (2025)
by: Wang, Weihang, et al.
Published: (2025)
GeoDANO: Geometric VLM with Domain Agnostic Vision Encoder
by: Cho, Seunghyuk, et al.
Published: (2025)
by: Cho, Seunghyuk, et al.
Published: (2025)
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment
by: Li, Lei, et al.
Published: (2024)
by: Li, Lei, et al.
Published: (2024)
Vision Language Models Are Not (Yet) Spelling Correctors
by: Liang, Junhong, et al.
Published: (2025)
by: Liang, Junhong, et al.
Published: (2025)
An Examination of the Compositionality of Large Generative Vision-Language Models
by: Ma, Teli, et al.
Published: (2023)
by: Ma, Teli, et al.
Published: (2023)
Layer-wise Alignment: Examining Safety Alignment Across Image Encoder Layers in Vision Language Models
by: Bachu, Saketh, et al.
Published: (2024)
by: Bachu, Saketh, et al.
Published: (2024)
BiggerGait: Unlocking Gait Recognition with Layer-wise Representations from Large Vision Models
by: Ye, Dingqiang, et al.
Published: (2025)
by: Ye, Dingqiang, et al.
Published: (2025)
Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge
by: Liang, Hao, et al.
Published: (2025)
by: Liang, Hao, et al.
Published: (2025)
Rethinking the Mixture of Vision Encoders Paradigm for Enhanced Visual Understanding in Multimodal LLMs
by: Azadani, Mozhgan Nasr, et al.
Published: (2025)
by: Azadani, Mozhgan Nasr, et al.
Published: (2025)
NPHardEval4V: Dynamic Evaluation of Large Vision-Language Models with Effects of Vision
by: Li, Xiang, et al.
Published: (2024)
by: Li, Xiang, et al.
Published: (2024)
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
by: Miranda, Imanol, et al.
Published: (2026)
by: Miranda, Imanol, et al.
Published: (2026)
MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems
by: Zhu, Zifeng, et al.
Published: (2024)
by: Zhu, Zifeng, et al.
Published: (2024)
VisionZip: Longer is Better but Not Necessary in Vision Language Models
by: Yang, Senqiao, et al.
Published: (2024)
by: Yang, Senqiao, et al.
Published: (2024)
KeyVideoLLM: Towards Large-scale Video Keyframe Selection
by: Liang, Hao, et al.
Published: (2024)
by: Liang, Hao, et al.
Published: (2024)
Mitigating Hallucinations in Large Vision-Language Models by Self-Injecting Hallucinations
by: Lu, Yifan, et al.
Published: (2025)
by: Lu, Yifan, et al.
Published: (2025)
LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
by: Dai, Yifan, et al.
Published: (2026)
by: Dai, Yifan, et al.
Published: (2026)
Can We Predict Performance of Large Models across Vision-Language Tasks?
by: Zhao, Qinyu, et al.
Published: (2024)
by: Zhao, Qinyu, et al.
Published: (2024)
Visual In-Context Learning for Large Vision-Language Models
by: Zhou, Yucheng, et al.
Published: (2024)
by: Zhou, Yucheng, et al.
Published: (2024)
Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models
by: Jiang, Lei, et al.
Published: (2025)
by: Jiang, Lei, et al.
Published: (2025)
LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving
by: Song, Nan, et al.
Published: (2025)
by: Song, Nan, et al.
Published: (2025)
Can Large Vision-Language Models Understand Multimodal Sarcasm?
by: Wang, Xinyu, et al.
Published: (2025)
by: Wang, Xinyu, et al.
Published: (2025)
A Unified Hallucination Mitigation Framework for Large Vision-Language Models
by: Chang, Yue, et al.
Published: (2024)
by: Chang, Yue, et al.
Published: (2024)
Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models
by: Panos, Aristeidis, et al.
Published: (2024)
by: Panos, Aristeidis, et al.
Published: (2024)
HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models
by: Wang, Xiao, et al.
Published: (2025)
by: Wang, Xiao, et al.
Published: (2025)
Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance
by: Zhao, Haozhe, et al.
Published: (2024)
by: Zhao, Haozhe, et al.
Published: (2024)
The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
by: Zhao, Qinyu, et al.
Published: (2024)
by: Zhao, Qinyu, et al.
Published: (2024)
Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models
by: Liang, Qiao, et al.
Published: (2025)
by: Liang, Qiao, et al.
Published: (2025)
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
by: Zhang, Ce, et al.
Published: (2025)
by: Zhang, Ce, et al.
Published: (2025)
Similar Items
-
ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models
by: Li, Bozhou, et al.
Published: (2025) -
SynthVLM: Towards High-Quality and Efficient Synthesis of Image-Caption Datasets for Vision-Language Models
by: Liu, Zheng, et al.
Published: (2024) -
Bigger is not Always Better: Scaling Properties of Latent Diffusion Models
by: Mei, Kangfu, et al.
Published: (2024) -
Is Bigger Always Better? Efficiency Analysis in Resource-Constrained Small Object Detection
by: Mbobda-Kuate, Kwame, et al.
Published: (2026) -
Beyond the Vision Encoder: Identifying and Mitigating Spatial Bias in Large Vision-Language Models
by: Zhu, Yingjie, et al.
Published: (2025)