:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Kim, Yunsoo, Ong, Michal W. S., Shavick, Alex, Wu, Honghan, Levine, Adam P.
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2509.16326
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Hallucination Benchmark in Medical Visual Question Answering
by: Wu, Jinge, et al.
Published: (2024)

Enhancing Human-Computer Interaction in Chest X-ray Analysis using Vision and Language Model with Eye Gaze Patterns
by: Kim, Yunsoo, et al.
Published: (2024)

Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation
by: Kim, Yunsoo, et al.
Published: (2025)

Exploring Multimodal Large Language Models for Radiology Report Error-checking
by: Wu, Jinge, et al.
Published: (2023)

RadEyeVideo: Enhancing general-domain Large Vision Language Model for chest X-ray analysis with video representations of eye gaze
by: Kim, Yunsoo, et al.
Published: (2025)

SLaVA-CXR: Small Language and Vision Assistant for Chest X-ray Report Automation
by: Wu, Jinge, et al.
Published: (2024)

IHC-LLMiner: Automated extraction of tumour immunohistochemical profiles from PubMed abstracts using large language models
by: Kim, Yunsoo, et al.
Published: (2025)

LoV3D: Grounding Cognitive Prognosis Reasoning in Longitudinal 3D Brain MRI via Regional Volume Assessments
by: Jiang, Zhaoyang, et al.
Published: (2026)

Vision-centric Token Compression in Large Language Model
by: Xing, Ling, et al.
Published: (2025)

Adversarial Attacks on VQA-NLE: Exposing and Alleviating Inconsistencies in Visual Question Answering Explanations
by: Yeh, Yahsin, et al.
Published: (2025)

BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedical Domain
by: Kim, Yunsoo, et al.
Published: (2025)

ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos
by: Phukan, Arpan, et al.
Published: (2024)

ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension
by: Hu, Yizhi, et al.
Published: (2025)

Text-centric Alignment for Multi-Modality Learning
by: Tsai, Yun-Da, et al.
Published: (2024)

E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction
by: Kim, Yunsoo, et al.
Published: (2026)

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
by: Nayak, Shravan, et al.
Published: (2025)

Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning
by: Tu, Yunbin, et al.
Published: (2024)

MedExQA: Medical Question Answering Benchmark with Multiple Explanations
by: Kim, Yunsoo, et al.
Published: (2024)

PathAlign: A vision-language model for whole slide images in histopathology
by: Ahmed, Faruk, et al.
Published: (2024)

A Survey of Multimodal Large Language Model from A Data-centric Perspective
by: Bai, Tianyi, et al.
Published: (2024)

Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation
by: Chaves, Juan Manuel Zambrano, et al.
Published: (2024)

Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models
by: Yamabe, Shojiro, et al.
Published: (2025)

LLMs Behind the Scenes: Enabling Narrative Scene Illustration
by: Roemmele, Melissa, et al.
Published: (2025)

Visual Program Distillation with Template-Based Augmentation
by: Shlapentokh-Rothman, Michal, et al.
Published: (2024)

Scaling medical imaging report generation with multimodal reinforcement learning
by: Liu, Qianchu, et al.
Published: (2026)

Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models
by: Padlewski, Piotr, et al.
Published: (2024)

GlyphPattern: An Abstract Pattern Recognition Benchmark for Vision-Language Models
by: Wu, Zixuan, et al.
Published: (2024)

Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals
by: Wu, Te-Lin, et al.
Published: (2021)

Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
by: Shlapentokh-Rothman, Michal, et al.
Published: (2026)

Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation
by: Rädsch, Tim, et al.
Published: (2025)

VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding
by: Waheed, Abdul, et al.
Published: (2025)

Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking
by: Beňová, Ivana, et al.
Published: (2024)

Code2Video: A Code-centric Paradigm for Educational Video Generation
by: Chen, Yanzhe, et al.
Published: (2025)

CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks
by: Suharitdamrong, Wish, et al.
Published: (2026)

See It All: Contextualized Late Aggregation for 3D Dense Captioning
by: Kim, Minjung, et al.
Published: (2024)

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models
by: Reka Team, et al.
Published: (2024)

Arctic-TILT. Business Document Understanding at Sub-Billion Scale
by: Borchmann, Łukasz, et al.
Published: (2024)

ESREAL: Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models
by: Kim, Minchan, et al.
Published: (2024)

Video sentence grounding with temporally global textual knowledge
by: Chen, Cai, et al.
Published: (2024)

v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning
by: Chung, Jiwan, et al.
Published: (2025)