:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wu, Yiqi, Hu, Xiaodan, Fu, Ziming, Zhou, Siling, Li, Jiangong
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2406.09781
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

MedViLaM: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation
by: Xu, Lijian, et al.
Published: (2024)

Do large language vision models understand 3D shapes?
by: Eppel, Sagi
Published: (2024)

Evaluating point-light biological motion in multimodal large language models
by: Kadambi, Akila, et al.
Published: (2025)

In-context learning enables multimodal large language models to classify cancer pathology images
by: Ferber, Dyke, et al.
Published: (2024)

LLaVAction: evaluating and training multi-modal large language models for action understanding
by: Qi, Haozhe, et al.
Published: (2025)

Chain-of-Caption: Training-free improvement of multimodal large language model on referring expression comprehension
by: Pang, Yik Lung, et al.
Published: (2026)

A benchmark multimodal oro-dental dataset for large vision-language models
by: Lv, Haoxin, et al.
Published: (2025)

On the robustness of multimodal language model towards distractions
by: Liu, Ming, et al.
Published: (2025)

Visual concept ranking uncovers medical shortcuts used by large multimodal models
by: Janizek, Joseph D., et al.
Published: (2026)

Beyond the Hype: A dispassionate look at vision-language models in medical scenario
by: Nan, Yang, et al.
Published: (2024)

Visual representations in the human brain are aligned with large language models
by: Doerig, Adrien, et al.
Published: (2022)

MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output
by: Chen, Yanyuan, et al.
Published: (2025)

Human-like object concept representations emerge naturally in multimodal large language models
by: Du, Changde, et al.
Published: (2024)

When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis
by: Zhang, Ruixuan, et al.
Published: (2025)

SalsaAgent: A multimodal embodied language model for interactive dance generation
by: Yazdian, Payam Jome, et al.
Published: (2026)

GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?
by: Wu, Wenhao, et al.
Published: (2023)

Have we unified image generation and understanding yet? An empirical study of GPT-4o's image generation ability
by: Li, Ning, et al.
Published: (2025)

EHWGesture -- A dataset for multimodal understanding of clinical gestures
by: Amprimo, Gianluca, et al.
Published: (2025)

Visual hallucination detection in large vision-language models via evidential conflict
by: Huang, Tao, et al.
Published: (2025)

Visual Interestingness Decoded: How GPT-4o Mirrors Human Interests
by: Abdullahu, Fitim, et al.
Published: (2025)

Building and better understanding vision-language models: insights and future directions
by: Laurençon, Hugo, et al.
Published: (2024)

An Empirical Study of GPT-4o Image Generation Capabilities
by: Chen, Sixiang, et al.
Published: (2025)

ViSTa Dataset: Do vision-language models understand sequential tasks?
by: Wybitul, Evžen, et al.
Published: (2024)

ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval
by: Fanelli, Nicola, et al.
Published: (2025)

Attacks on multimodal models
by: Iablochnikov, Viacheslav, et al.
Published: (2024)

GPT as Psychologist? Preliminary Evaluations for GPT-4V on Visual Affective Computing
by: Lu, Hao, et al.
Published: (2024)

What do vision-language models see in the context? Investigating multimodal in-context learning
by: Santos, Gabriel O. dos, et al.
Published: (2025)

Assessing the alignment between infants' visual and linguistic experience using multimodal language models
by: Tan, Alvin Wei Ming, et al.
Published: (2025)

Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models
by: Padlewski, Piotr, et al.
Published: (2024)

A Cross-Hierarchical Difference Feature Fusion Network Based on Multiscale Encoder-Decoder for Hyperspectral Change Detection
by: Sheng, Mingshuai, et al.
Published: (2025)

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine
by: Jin, Qiao, et al.
Published: (2024)

Explaining latent representations of generative models with large multimodal models
by: Zhu, Mengdan, et al.
Published: (2024)

ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter
by: Yuan, Zhengqing, et al.
Published: (2023)

Remote Sensing ChatGPT: Solving Remote Sensing Tasks with ChatGPT and Visual Models
by: Guo, Haonan, et al.
Published: (2024)

IQAGPT: Image Quality Assessment with Vision-language and ChatGPT Models
by: Chen, Zhihao, et al.
Published: (2023)

Automatic benchmarking of large multimodal models via iterative experiment programming
by: Conti, Alessandro, et al.
Published: (2024)

Physics-Inspired Modeling and Content Adaptive Routing in an Infrared Gas Leak Detection Network
by: Li, Dongsheng, et al.
Published: (2025)

Teaching large language models to reason like expert diagnosticians
by: Buckley, Thomas A., et al.
Published: (2025)

GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation
by: Yan, Zhiyuan, et al.
Published: (2025)

P4Q: Learning to Prompt for Quantization in Visual-language Models
by: Sun, Huixin, et al.
Published: (2024)