:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Bozhou, Liang, Hao, Meng, Zimo, Zhang, Wentao
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Computation and Language
Online Access:	https://arxiv.org/abs/2408.00620
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models
by: Li, Bozhou, et al.
Published: (2025)

SynthVLM: Towards High-Quality and Efficient Synthesis of Image-Caption Datasets for Vision-Language Models
by: Liu, Zheng, et al.
Published: (2024)

Bigger is not Always Better: Scaling Properties of Latent Diffusion Models
by: Mei, Kangfu, et al.
Published: (2024)

Is Bigger Always Better? Efficiency Analysis in Resource-Constrained Small Object Detection
by: Mbobda-Kuate, Kwame, et al.
Published: (2026)

Beyond the Vision Encoder: Identifying and Mitigating Spatial Bias in Large Vision-Language Models
by: Zhu, Yingjie, et al.
Published: (2025)

VEGAS: Mitigating Hallucinations in Large Vision-Language Models via Vision-Encoder Attention Guided Adaptive Steering
by: Wang, Zihu, et al.
Published: (2025)

CLIP-Adapter: Better Vision-Language Models with Feature Adapters
by: Gao, Peng, et al.
Published: (2021)

A Survey of Multimodal Large Language Model from A Data-centric Perspective
by: Bai, Tianyi, et al.
Published: (2024)

Speed Always Wins: A Survey on Efficient Architectures for Large Language Models
by: Sun, Weigao, et al.
Published: (2025)

MathScape: Benchmarking Multimodal Large Language Models in Real-World Mathematical Contexts
by: Liang, Hao, et al.
Published: (2024)

EVQAScore: A Fine-grained Metric for Video Question Answering Data Quality Evaluation
by: Liang, Hao, et al.
Published: (2024)

Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving
by: Li, Yue, et al.
Published: (2025)

Diving into Mitigating Hallucinations from a Vision Perspective for Large Vision-Language Models
by: Wang, Weihang, et al.
Published: (2025)

GeoDANO: Geometric VLM with Domain Agnostic Vision Encoder
by: Cho, Seunghyuk, et al.
Published: (2025)

VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment
by: Li, Lei, et al.
Published: (2024)

Vision Language Models Are Not (Yet) Spelling Correctors
by: Liang, Junhong, et al.
Published: (2025)

An Examination of the Compositionality of Large Generative Vision-Language Models
by: Ma, Teli, et al.
Published: (2023)

Layer-wise Alignment: Examining Safety Alignment Across Image Encoder Layers in Vision Language Models
by: Bachu, Saketh, et al.
Published: (2024)

BiggerGait: Unlocking Gait Recognition with Layer-wise Representations from Large Vision Models
by: Ye, Dingqiang, et al.
Published: (2025)

Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge
by: Liang, Hao, et al.
Published: (2025)

Rethinking the Mixture of Vision Encoders Paradigm for Enhanced Visual Understanding in Multimodal LLMs
by: Azadani, Mozhgan Nasr, et al.
Published: (2025)

NPHardEval4V: Dynamic Evaluation of Large Vision-Language Models with Effects of Vision
by: Li, Xiang, et al.
Published: (2024)

Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
by: Miranda, Imanol, et al.
Published: (2026)

MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems
by: Zhu, Zifeng, et al.
Published: (2024)

VisionZip: Longer is Better but Not Necessary in Vision Language Models
by: Yang, Senqiao, et al.
Published: (2024)

KeyVideoLLM: Towards Large-scale Video Keyframe Selection
by: Liang, Hao, et al.
Published: (2024)

Mitigating Hallucinations in Large Vision-Language Models by Self-Injecting Hallucinations
by: Lu, Yifan, et al.
Published: (2025)

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
by: Dai, Yifan, et al.
Published: (2026)

Can We Predict Performance of Large Models across Vision-Language Tasks?
by: Zhao, Qinyu, et al.
Published: (2024)

Visual In-Context Learning for Large Vision-Language Models
by: Zhou, Yucheng, et al.
Published: (2024)

Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models
by: Jiang, Lei, et al.
Published: (2025)

LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving
by: Song, Nan, et al.
Published: (2025)

Can Large Vision-Language Models Understand Multimodal Sarcasm?
by: Wang, Xinyu, et al.
Published: (2025)

A Unified Hallucination Mitigation Framework for Large Vision-Language Models
by: Chang, Yue, et al.
Published: (2024)

Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models
by: Panos, Aristeidis, et al.
Published: (2024)

HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models
by: Wang, Xiao, et al.
Published: (2025)

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance
by: Zhao, Haozhe, et al.
Published: (2024)

The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
by: Zhao, Qinyu, et al.
Published: (2024)

Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models
by: Liang, Qiao, et al.
Published: (2025)

VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
by: Zhang, Ce, et al.
Published: (2025)