Saved in:
| Main Author: | Lan, HaoTian |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.05087 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Parking, Perception, and Retail: Street-Level Determinants of Community Vitality in Harbin
by: Lan, HaoTian
Published: (2025)
by: Lan, HaoTian
Published: (2025)
A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling
by: Buettner, Kyle, et al.
Published: (2025)
by: Buettner, Kyle, et al.
Published: (2025)
Urban Safety Perception Assessments via Integrating Multimodal Large Language Models with Street View Images
by: Zhang, Jiaxin, et al.
Published: (2024)
by: Zhang, Jiaxin, et al.
Published: (2024)
Multimodal Arabic Captioning with Interpretable Visual Concept Integration
by: Elchafei, Passant, et al.
Published: (2025)
by: Elchafei, Passant, et al.
Published: (2025)
Multimodal Integration of Human-Like Attention in Visual Question Answering
by: Sood, Ekta, et al.
Published: (2021)
by: Sood, Ekta, et al.
Published: (2021)
Interleaved Latent Visual Reasoning with Selective Perceptual Modeling
by: Dong, Shuai, et al.
Published: (2025)
by: Dong, Shuai, et al.
Published: (2025)
Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts
by: Nooralahzadeh, Farhad, et al.
Published: (2026)
by: Nooralahzadeh, Farhad, et al.
Published: (2026)
Optimizing Multimodal Language Models through Attention-based Interpretability
by: Sergeev, Alexander, et al.
Published: (2025)
by: Sergeev, Alexander, et al.
Published: (2025)
Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning
by: Zhao, Zhixian, et al.
Published: (2026)
by: Zhao, Zhixian, et al.
Published: (2026)
Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning
by: Ma, Chuang, et al.
Published: (2026)
by: Ma, Chuang, et al.
Published: (2026)
CHART-6: Human-Centered Evaluation of Data Visualization Understanding in Vision-Language Models
by: Verma, Arnav, et al.
Published: (2025)
by: Verma, Arnav, et al.
Published: (2025)
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
by: Hu, Yushi, et al.
Published: (2024)
by: Hu, Yushi, et al.
Published: (2024)
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
by: Guan, Tianrui, et al.
Published: (2023)
by: Guan, Tianrui, et al.
Published: (2023)
MMRQA: Signal-Enhanced Multimodal Large Language Models for MRI Quality Assessment
by: Jia, Fankai, et al.
Published: (2025)
by: Jia, Fankai, et al.
Published: (2025)
Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration
by: Park, ChaeHun, et al.
Published: (2024)
by: Park, ChaeHun, et al.
Published: (2024)
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model
by: Jiang, Chaoya, et al.
Published: (2024)
by: Jiang, Chaoya, et al.
Published: (2024)
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models
by: Zhang, Wenqiao, et al.
Published: (2024)
by: Zhang, Wenqiao, et al.
Published: (2024)
BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models
by: Arnould, Ludovic, et al.
Published: (2025)
by: Arnould, Ludovic, et al.
Published: (2025)
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models
by: Luo, Fuwen, et al.
Published: (2024)
by: Luo, Fuwen, et al.
Published: (2024)
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
by: Wang, Zhaokai, et al.
Published: (2025)
by: Wang, Zhaokai, et al.
Published: (2025)
OphIn-500K: Curating Web-Scale Visual Instructions for Scaling Ophthalmic Multimodal Large Language Models
by: Dong, Xuanzhao, et al.
Published: (2026)
by: Dong, Xuanzhao, et al.
Published: (2026)
MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems
by: Li, Kaixin, et al.
Published: (2024)
by: Li, Kaixin, et al.
Published: (2024)
Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment
by: Li, Yunxin, et al.
Published: (2024)
by: Li, Yunxin, et al.
Published: (2024)
Diagnosing Urban Street Vitality via a Visual-Semantic and Spatiotemporal Framework for Street-Level Economics
by: Zhuo, Xinxin, et al.
Published: (2026)
by: Zhuo, Xinxin, et al.
Published: (2026)
AVG-LLaVA: An Efficient Large Multimodal Model with Adaptive Visual Granularity
by: Lan, Zhibin, et al.
Published: (2024)
by: Lan, Zhibin, et al.
Published: (2024)
ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers
by: Yuan, Qianhao, et al.
Published: (2025)
by: Yuan, Qianhao, et al.
Published: (2025)
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
by: Li, Yifan, et al.
Published: (2024)
by: Li, Yifan, et al.
Published: (2024)
GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices
by: Lu, Xudong, et al.
Published: (2025)
by: Lu, Xudong, et al.
Published: (2025)
VICCA: Visual Interpretation and Comprehension of Chest X-ray Anomalies in Generated Report Without Human Feedback
by: Picha, Sayeh Gholipour, et al.
Published: (2025)
by: Picha, Sayeh Gholipour, et al.
Published: (2025)
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
by: Luo, Gen, et al.
Published: (2025)
by: Luo, Gen, et al.
Published: (2025)
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters
by: Wang, Weizhi, et al.
Published: (2024)
by: Wang, Weizhi, et al.
Published: (2024)
DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model
by: Yang, Hao, et al.
Published: (2026)
by: Yang, Hao, et al.
Published: (2026)
DamageArbiter: A CLIP-Enhanced Multimodal Arbitration Framework for Hurricane Damage Assessment from Street-View Imagery
by: Yang, Yifan, et al.
Published: (2026)
by: Yang, Yifan, et al.
Published: (2026)
EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models
by: Tan, Zhiyu, et al.
Published: (2024)
by: Tan, Zhiyu, et al.
Published: (2024)
From Street View to Visual Network: Mapping the Visibility of Urban Landmarks with Vision-Language Models
by: Fan, Zicheng, et al.
Published: (2025)
by: Fan, Zicheng, et al.
Published: (2025)
Graph-Driven Multimodal Feature Learning Framework for Apparent Personality Assessment
by: Wang, Kangsheng, et al.
Published: (2025)
by: Wang, Kangsheng, et al.
Published: (2025)
EFUF: Efficient Fine-grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large Language Models
by: Xing, Shangyu, et al.
Published: (2024)
by: Xing, Shangyu, et al.
Published: (2024)
Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input
by: Li, Chenxu, et al.
Published: (2025)
by: Li, Chenxu, et al.
Published: (2025)
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
by: Luo, Gen, et al.
Published: (2024)
by: Luo, Gen, et al.
Published: (2024)
Using Multimodal Deep Neural Networks to Disentangle Language from Visual Aesthetics
by: Conwell, Colin, et al.
Published: (2024)
by: Conwell, Colin, et al.
Published: (2024)
Similar Items
-
Parking, Perception, and Retail: Street-Level Determinants of Community Vitality in Harbin
by: Lan, HaoTian
Published: (2025) -
A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling
by: Buettner, Kyle, et al.
Published: (2025) -
Urban Safety Perception Assessments via Integrating Multimodal Large Language Models with Street View Images
by: Zhang, Jiaxin, et al.
Published: (2024) -
Multimodal Arabic Captioning with Interpretable Visual Concept Integration
by: Elchafei, Passant, et al.
Published: (2025) -
Multimodal Integration of Human-Like Attention in Visual Question Answering
by: Sood, Ekta, et al.
Published: (2021)