:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Lan, HaoTian
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Computation and Language
Online Access:	https://arxiv.org/abs/2506.05087
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Parking, Perception, and Retail: Street-Level Determinants of Community Vitality in Harbin
by: Lan, HaoTian
Published: (2025)

A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling
by: Buettner, Kyle, et al.
Published: (2025)

Urban Safety Perception Assessments via Integrating Multimodal Large Language Models with Street View Images
by: Zhang, Jiaxin, et al.
Published: (2024)

Multimodal Arabic Captioning with Interpretable Visual Concept Integration
by: Elchafei, Passant, et al.
Published: (2025)

Multimodal Integration of Human-Like Attention in Visual Question Answering
by: Sood, Ekta, et al.
Published: (2021)

Interleaved Latent Visual Reasoning with Selective Perceptual Modeling
by: Dong, Shuai, et al.
Published: (2025)

Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts
by: Nooralahzadeh, Farhad, et al.
Published: (2026)

Optimizing Multimodal Language Models through Attention-based Interpretability
by: Sergeev, Alexander, et al.
Published: (2025)

Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning
by: Zhao, Zhixian, et al.
Published: (2026)

Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning
by: Ma, Chuang, et al.
Published: (2026)

CHART-6: Human-Centered Evaluation of Data Visualization Understanding in Vision-Language Models
by: Verma, Arnav, et al.
Published: (2025)

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
by: Hu, Yushi, et al.
Published: (2024)

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
by: Guan, Tianrui, et al.
Published: (2023)

MMRQA: Signal-Enhanced Multimodal Large Language Models for MRI Quality Assessment
by: Jia, Fankai, et al.
Published: (2025)

Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration
by: Park, ChaeHun, et al.
Published: (2024)

MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model
by: Jiang, Chaoya, et al.
Published: (2024)

HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models
by: Zhang, Wenqiao, et al.
Published: (2024)

BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models
by: Arnould, Ludovic, et al.
Published: (2025)

CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models
by: Luo, Fuwen, et al.
Published: (2024)

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
by: Wang, Zhaokai, et al.
Published: (2025)

OphIn-500K: Curating Web-Scale Visual Instructions for Scaling Ophthalmic Multimodal Large Language Models
by: Dong, Xuanzhao, et al.
Published: (2026)

MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems
by: Li, Kaixin, et al.
Published: (2024)

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment
by: Li, Yunxin, et al.
Published: (2024)

Diagnosing Urban Street Vitality via a Visual-Semantic and Spatiotemporal Framework for Street-Level Economics
by: Zhuo, Xinxin, et al.
Published: (2026)

AVG-LLaVA: An Efficient Large Multimodal Model with Adaptive Visual Granularity
by: Lan, Zhibin, et al.
Published: (2024)

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers
by: Yuan, Qianhao, et al.
Published: (2025)

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
by: Li, Yifan, et al.
Published: (2024)

GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices
by: Lu, Xudong, et al.
Published: (2025)

VICCA: Visual Interpretation and Comprehension of Chest X-ray Anomalies in Generated Report Without Human Feedback
by: Picha, Sayeh Gholipour, et al.
Published: (2025)

Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
by: Luo, Gen, et al.
Published: (2025)

Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters
by: Wang, Weizhi, et al.
Published: (2024)

DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model
by: Yang, Hao, et al.
Published: (2026)

DamageArbiter: A CLIP-Enhanced Multimodal Arbitration Framework for Hurricane Damage Assessment from Street-View Imagery
by: Yang, Yifan, et al.
Published: (2026)

EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models
by: Tan, Zhiyu, et al.
Published: (2024)

From Street View to Visual Network: Mapping the Visibility of Urban Landmarks with Vision-Language Models
by: Fan, Zicheng, et al.
Published: (2025)

Graph-Driven Multimodal Feature Learning Framework for Apparent Personality Assessment
by: Wang, Kangsheng, et al.
Published: (2025)

EFUF: Efficient Fine-grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large Language Models
by: Xing, Shangyu, et al.
Published: (2024)

Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input
by: Li, Chenxu, et al.
Published: (2025)

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
by: Luo, Gen, et al.
Published: (2024)

Using Multimodal Deep Neural Networks to Disentangle Language from Visual Aesthetics
by: Conwell, Colin, et al.
Published: (2024)