Saved in:
| Main Authors: | Kim, Yunsoo, Ong, Michal W. S., Shavick, Alex, Wu, Honghan, Levine, Adam P. |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.16326 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Hallucination Benchmark in Medical Visual Question Answering
by: Wu, Jinge, et al.
Published: (2024)
by: Wu, Jinge, et al.
Published: (2024)
Enhancing Human-Computer Interaction in Chest X-ray Analysis using Vision and Language Model with Eye Gaze Patterns
by: Kim, Yunsoo, et al.
Published: (2024)
by: Kim, Yunsoo, et al.
Published: (2024)
Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation
by: Kim, Yunsoo, et al.
Published: (2025)
by: Kim, Yunsoo, et al.
Published: (2025)
Exploring Multimodal Large Language Models for Radiology Report Error-checking
by: Wu, Jinge, et al.
Published: (2023)
by: Wu, Jinge, et al.
Published: (2023)
RadEyeVideo: Enhancing general-domain Large Vision Language Model for chest X-ray analysis with video representations of eye gaze
by: Kim, Yunsoo, et al.
Published: (2025)
by: Kim, Yunsoo, et al.
Published: (2025)
SLaVA-CXR: Small Language and Vision Assistant for Chest X-ray Report Automation
by: Wu, Jinge, et al.
Published: (2024)
by: Wu, Jinge, et al.
Published: (2024)
IHC-LLMiner: Automated extraction of tumour immunohistochemical profiles from PubMed abstracts using large language models
by: Kim, Yunsoo, et al.
Published: (2025)
by: Kim, Yunsoo, et al.
Published: (2025)
LoV3D: Grounding Cognitive Prognosis Reasoning in Longitudinal 3D Brain MRI via Regional Volume Assessments
by: Jiang, Zhaoyang, et al.
Published: (2026)
by: Jiang, Zhaoyang, et al.
Published: (2026)
Vision-centric Token Compression in Large Language Model
by: Xing, Ling, et al.
Published: (2025)
by: Xing, Ling, et al.
Published: (2025)
Adversarial Attacks on VQA-NLE: Exposing and Alleviating Inconsistencies in Visual Question Answering Explanations
by: Yeh, Yahsin, et al.
Published: (2025)
by: Yeh, Yahsin, et al.
Published: (2025)
BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedical Domain
by: Kim, Yunsoo, et al.
Published: (2025)
by: Kim, Yunsoo, et al.
Published: (2025)
ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos
by: Phukan, Arpan, et al.
Published: (2024)
by: Phukan, Arpan, et al.
Published: (2024)
ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension
by: Hu, Yizhi, et al.
Published: (2025)
by: Hu, Yizhi, et al.
Published: (2025)
Text-centric Alignment for Multi-Modality Learning
by: Tsai, Yun-Da, et al.
Published: (2024)
by: Tsai, Yun-Da, et al.
Published: (2024)
E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction
by: Kim, Yunsoo, et al.
Published: (2026)
by: Kim, Yunsoo, et al.
Published: (2026)
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
by: Nayak, Shravan, et al.
Published: (2025)
by: Nayak, Shravan, et al.
Published: (2025)
Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning
by: Tu, Yunbin, et al.
Published: (2024)
by: Tu, Yunbin, et al.
Published: (2024)
MedExQA: Medical Question Answering Benchmark with Multiple Explanations
by: Kim, Yunsoo, et al.
Published: (2024)
by: Kim, Yunsoo, et al.
Published: (2024)
PathAlign: A vision-language model for whole slide images in histopathology
by: Ahmed, Faruk, et al.
Published: (2024)
by: Ahmed, Faruk, et al.
Published: (2024)
A Survey of Multimodal Large Language Model from A Data-centric Perspective
by: Bai, Tianyi, et al.
Published: (2024)
by: Bai, Tianyi, et al.
Published: (2024)
Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation
by: Chaves, Juan Manuel Zambrano, et al.
Published: (2024)
by: Chaves, Juan Manuel Zambrano, et al.
Published: (2024)
Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models
by: Yamabe, Shojiro, et al.
Published: (2025)
by: Yamabe, Shojiro, et al.
Published: (2025)
LLMs Behind the Scenes: Enabling Narrative Scene Illustration
by: Roemmele, Melissa, et al.
Published: (2025)
by: Roemmele, Melissa, et al.
Published: (2025)
Visual Program Distillation with Template-Based Augmentation
by: Shlapentokh-Rothman, Michal, et al.
Published: (2024)
by: Shlapentokh-Rothman, Michal, et al.
Published: (2024)
Scaling medical imaging report generation with multimodal reinforcement learning
by: Liu, Qianchu, et al.
Published: (2026)
by: Liu, Qianchu, et al.
Published: (2026)
Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models
by: Padlewski, Piotr, et al.
Published: (2024)
by: Padlewski, Piotr, et al.
Published: (2024)
GlyphPattern: An Abstract Pattern Recognition Benchmark for Vision-Language Models
by: Wu, Zixuan, et al.
Published: (2024)
by: Wu, Zixuan, et al.
Published: (2024)
Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals
by: Wu, Te-Lin, et al.
Published: (2021)
by: Wu, Te-Lin, et al.
Published: (2021)
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
by: Shlapentokh-Rothman, Michal, et al.
Published: (2026)
by: Shlapentokh-Rothman, Michal, et al.
Published: (2026)
Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation
by: Rädsch, Tim, et al.
Published: (2025)
by: Rädsch, Tim, et al.
Published: (2025)
VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding
by: Waheed, Abdul, et al.
Published: (2025)
by: Waheed, Abdul, et al.
Published: (2025)
Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking
by: Beňová, Ivana, et al.
Published: (2024)
by: Beňová, Ivana, et al.
Published: (2024)
Code2Video: A Code-centric Paradigm for Educational Video Generation
by: Chen, Yanzhe, et al.
Published: (2025)
by: Chen, Yanzhe, et al.
Published: (2025)
CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks
by: Suharitdamrong, Wish, et al.
Published: (2026)
by: Suharitdamrong, Wish, et al.
Published: (2026)
See It All: Contextualized Late Aggregation for 3D Dense Captioning
by: Kim, Minjung, et al.
Published: (2024)
by: Kim, Minjung, et al.
Published: (2024)
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models
by: Reka Team, et al.
Published: (2024)
by: Reka Team, et al.
Published: (2024)
Arctic-TILT. Business Document Understanding at Sub-Billion Scale
by: Borchmann, Łukasz, et al.
Published: (2024)
by: Borchmann, Łukasz, et al.
Published: (2024)
ESREAL: Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models
by: Kim, Minchan, et al.
Published: (2024)
by: Kim, Minchan, et al.
Published: (2024)
Video sentence grounding with temporally global textual knowledge
by: Chen, Cai, et al.
Published: (2024)
by: Chen, Cai, et al.
Published: (2024)
v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning
by: Chung, Jiwan, et al.
Published: (2025)
by: Chung, Jiwan, et al.
Published: (2025)
Similar Items
-
Hallucination Benchmark in Medical Visual Question Answering
by: Wu, Jinge, et al.
Published: (2024) -
Enhancing Human-Computer Interaction in Chest X-ray Analysis using Vision and Language Model with Eye Gaze Patterns
by: Kim, Yunsoo, et al.
Published: (2024) -
Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation
by: Kim, Yunsoo, et al.
Published: (2025) -
Exploring Multimodal Large Language Models for Radiology Report Error-checking
by: Wu, Jinge, et al.
Published: (2023) -
RadEyeVideo: Enhancing general-domain Large Vision Language Model for chest X-ray analysis with video representations of eye gaze
by: Kim, Yunsoo, et al.
Published: (2025)