Saved in:
| Main Authors: | Wu, Yiqi, Hu, Xiaodan, Fu, Ziming, Zhou, Siling, Li, Jiangong |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2406.09781 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MedViLaM: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation
by: Xu, Lijian, et al.
Published: (2024)
by: Xu, Lijian, et al.
Published: (2024)
Do large language vision models understand 3D shapes?
by: Eppel, Sagi
Published: (2024)
by: Eppel, Sagi
Published: (2024)
Evaluating point-light biological motion in multimodal large language models
by: Kadambi, Akila, et al.
Published: (2025)
by: Kadambi, Akila, et al.
Published: (2025)
In-context learning enables multimodal large language models to classify cancer pathology images
by: Ferber, Dyke, et al.
Published: (2024)
by: Ferber, Dyke, et al.
Published: (2024)
LLaVAction: evaluating and training multi-modal large language models for action understanding
by: Qi, Haozhe, et al.
Published: (2025)
by: Qi, Haozhe, et al.
Published: (2025)
Chain-of-Caption: Training-free improvement of multimodal large language model on referring expression comprehension
by: Pang, Yik Lung, et al.
Published: (2026)
by: Pang, Yik Lung, et al.
Published: (2026)
A benchmark multimodal oro-dental dataset for large vision-language models
by: Lv, Haoxin, et al.
Published: (2025)
by: Lv, Haoxin, et al.
Published: (2025)
On the robustness of multimodal language model towards distractions
by: Liu, Ming, et al.
Published: (2025)
by: Liu, Ming, et al.
Published: (2025)
Visual concept ranking uncovers medical shortcuts used by large multimodal models
by: Janizek, Joseph D., et al.
Published: (2026)
by: Janizek, Joseph D., et al.
Published: (2026)
Beyond the Hype: A dispassionate look at vision-language models in medical scenario
by: Nan, Yang, et al.
Published: (2024)
by: Nan, Yang, et al.
Published: (2024)
Visual representations in the human brain are aligned with large language models
by: Doerig, Adrien, et al.
Published: (2022)
by: Doerig, Adrien, et al.
Published: (2022)
MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output
by: Chen, Yanyuan, et al.
Published: (2025)
by: Chen, Yanyuan, et al.
Published: (2025)
Human-like object concept representations emerge naturally in multimodal large language models
by: Du, Changde, et al.
Published: (2024)
by: Du, Changde, et al.
Published: (2024)
When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis
by: Zhang, Ruixuan, et al.
Published: (2025)
by: Zhang, Ruixuan, et al.
Published: (2025)
SalsaAgent: A multimodal embodied language model for interactive dance generation
by: Yazdian, Payam Jome, et al.
Published: (2026)
by: Yazdian, Payam Jome, et al.
Published: (2026)
GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?
by: Wu, Wenhao, et al.
Published: (2023)
by: Wu, Wenhao, et al.
Published: (2023)
Have we unified image generation and understanding yet? An empirical study of GPT-4o's image generation ability
by: Li, Ning, et al.
Published: (2025)
by: Li, Ning, et al.
Published: (2025)
EHWGesture -- A dataset for multimodal understanding of clinical gestures
by: Amprimo, Gianluca, et al.
Published: (2025)
by: Amprimo, Gianluca, et al.
Published: (2025)
Visual hallucination detection in large vision-language models via evidential conflict
by: Huang, Tao, et al.
Published: (2025)
by: Huang, Tao, et al.
Published: (2025)
Visual Interestingness Decoded: How GPT-4o Mirrors Human Interests
by: Abdullahu, Fitim, et al.
Published: (2025)
by: Abdullahu, Fitim, et al.
Published: (2025)
Building and better understanding vision-language models: insights and future directions
by: Laurençon, Hugo, et al.
Published: (2024)
by: Laurençon, Hugo, et al.
Published: (2024)
An Empirical Study of GPT-4o Image Generation Capabilities
by: Chen, Sixiang, et al.
Published: (2025)
by: Chen, Sixiang, et al.
Published: (2025)
ViSTa Dataset: Do vision-language models understand sequential tasks?
by: Wybitul, Evžen, et al.
Published: (2024)
by: Wybitul, Evžen, et al.
Published: (2024)
ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval
by: Fanelli, Nicola, et al.
Published: (2025)
by: Fanelli, Nicola, et al.
Published: (2025)
Attacks on multimodal models
by: Iablochnikov, Viacheslav, et al.
Published: (2024)
by: Iablochnikov, Viacheslav, et al.
Published: (2024)
GPT as Psychologist? Preliminary Evaluations for GPT-4V on Visual Affective Computing
by: Lu, Hao, et al.
Published: (2024)
by: Lu, Hao, et al.
Published: (2024)
What do vision-language models see in the context? Investigating multimodal in-context learning
by: Santos, Gabriel O. dos, et al.
Published: (2025)
by: Santos, Gabriel O. dos, et al.
Published: (2025)
Assessing the alignment between infants' visual and linguistic experience using multimodal language models
by: Tan, Alvin Wei Ming, et al.
Published: (2025)
by: Tan, Alvin Wei Ming, et al.
Published: (2025)
Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models
by: Padlewski, Piotr, et al.
Published: (2024)
by: Padlewski, Piotr, et al.
Published: (2024)
A Cross-Hierarchical Difference Feature Fusion Network Based on Multiscale Encoder-Decoder for Hyperspectral Change Detection
by: Sheng, Mingshuai, et al.
Published: (2025)
by: Sheng, Mingshuai, et al.
Published: (2025)
Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine
by: Jin, Qiao, et al.
Published: (2024)
by: Jin, Qiao, et al.
Published: (2024)
Explaining latent representations of generative models with large multimodal models
by: Zhu, Mengdan, et al.
Published: (2024)
by: Zhu, Mengdan, et al.
Published: (2024)
ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter
by: Yuan, Zhengqing, et al.
Published: (2023)
by: Yuan, Zhengqing, et al.
Published: (2023)
Remote Sensing ChatGPT: Solving Remote Sensing Tasks with ChatGPT and Visual Models
by: Guo, Haonan, et al.
Published: (2024)
by: Guo, Haonan, et al.
Published: (2024)
IQAGPT: Image Quality Assessment with Vision-language and ChatGPT Models
by: Chen, Zhihao, et al.
Published: (2023)
by: Chen, Zhihao, et al.
Published: (2023)
Automatic benchmarking of large multimodal models via iterative experiment programming
by: Conti, Alessandro, et al.
Published: (2024)
by: Conti, Alessandro, et al.
Published: (2024)
Physics-Inspired Modeling and Content Adaptive Routing in an Infrared Gas Leak Detection Network
by: Li, Dongsheng, et al.
Published: (2025)
by: Li, Dongsheng, et al.
Published: (2025)
Teaching large language models to reason like expert diagnosticians
by: Buckley, Thomas A., et al.
Published: (2025)
by: Buckley, Thomas A., et al.
Published: (2025)
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation
by: Yan, Zhiyuan, et al.
Published: (2025)
by: Yan, Zhiyuan, et al.
Published: (2025)
P4Q: Learning to Prompt for Quantization in Visual-language Models
by: Sun, Huixin, et al.
Published: (2024)
by: Sun, Huixin, et al.
Published: (2024)
Similar Items
-
MedViLaM: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation
by: Xu, Lijian, et al.
Published: (2024) -
Do large language vision models understand 3D shapes?
by: Eppel, Sagi
Published: (2024) -
Evaluating point-light biological motion in multimodal large language models
by: Kadambi, Akila, et al.
Published: (2025) -
In-context learning enables multimodal large language models to classify cancer pathology images
by: Ferber, Dyke, et al.
Published: (2024) -
LLaVAction: evaluating and training multi-modal large language models for action understanding
by: Qi, Haozhe, et al.
Published: (2025)