:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Burapacheep, Jirayu, Gaur, Ishan, Bhatia, Agam, Thrush, Tristan
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Computation and Language
Online Access:	https://arxiv.org/abs/2402.04492
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Nearest Neighbor Normalization Improves Multimodal Retrieval
by: Chowdhury, Neil, et al.
Published: (2024)

ColorFoil: Investigating Color Blindness in Large Vision and Language Models
by: Samin, Ahnaf Mozib, et al.
Published: (2024)

ColorConceptBench: A Benchmark for Probabilistic Color-Concept Understanding in Text-to-Image Models
by: Ruan, Chenxi, et al.
Published: (2026)

ColorBlindnessEval: Can Vision-Language Models Pass Color Blindness Tests?
by: Ling, Zijian, et al.
Published: (2025)

ARGS: Alignment as Reward-Guided Search
by: Khanov, Maxim, et al.
Published: (2024)

ERIT Lightweight Multimodal Dataset for Elderly Emotion Recognition and Multimodal Fusion Evaluation
by: Frieske, Rita, et al.
Published: (2024)

VideoConviction: A Multimodal Benchmark for Human Conviction and Stock Market Recommendations
by: Galarnyk, Michael, et al.
Published: (2025)

Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations
by: Moll, Johannes, et al.
Published: (2025)

Leveraging Vision-Language Pre-training for Human Activity Recognition in Still Images
by: Mahanta, Cristina, et al.
Published: (2025)

The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models
by: Paruchuri, Akshay, et al.
Published: (2026)

Hummus: A Dataset of Humorous Multimodal Metaphor Use
by: Tong, Xiaoyu, et al.
Published: (2025)

Qalam : A Multimodal LLM for Arabic Optical Character and Handwriting Recognition
by: Bhatia, Gagan, et al.
Published: (2024)

FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint Textual and Visual Clues
by: Li, Shuang, et al.
Published: (2024)

Beyond Words: Multimodal LLM Knows When to Speak
by: Liao, Zikai, et al.
Published: (2025)

What Color Scheme is More Effective in Assisting Readers to Locate Information in a Color-Coded Article?
by: Ng, Ho Yin, et al.
Published: (2024)

ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
by: Liang, Yijun, et al.
Published: (2025)

A Grounded Typology of Word Classes
by: Haley, Coleman, et al.
Published: (2024)

A Thousand Words or An Image: Studying the Influence of Persona Modality in Multimodal LLMs
by: Broomfield, Julius, et al.
Published: (2025)

Multimodal Human-AI Synergy for Medical Imaging Quality Control: A Hybrid Intelligence Framework with Adaptive Dataset Curation and Closed-Loop Evaluation
by: Qin, Zhi, et al.
Published: (2025)

Decoding Emotions in Abstract Art: Cognitive Plausibility of CLIP in Recognizing Color-Emotion Associations
by: Widhoelzl, Hanna-Sophia, et al.
Published: (2024)

Towards Patronizing and Condescending Language in Chinese Videos: A Multimodal Dataset and Detector
by: Wang, Hongbo, et al.
Published: (2024)

Diagnosing Vision Language Models' Perception by Leveraging Human Methods for Color Vision Deficiencies
by: Hayashi, Kazuki, et al.
Published: (2025)

ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly
by: Hasegawa, Kimihiro, et al.
Published: (2025)

WordVIS: A Color Worth A Thousand Words
by: Khan, Umar, et al.
Published: (2024)

Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
by: Yun, Sukmin, et al.
Published: (2024)

Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models
by: Li, Lei, et al.
Published: (2024)

PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering
by: Ding, Yihao, et al.
Published: (2024)

VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual Question Answering
by: Sood, Ekta, et al.
Published: (2021)

From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models
by: Bhatia, Mehar, et al.
Published: (2024)

CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation
by: Wang, Yuxuan, et al.
Published: (2024)

LLaVA-Critic: Learning to Evaluate Multimodal Models
by: Xiong, Tianyi, et al.
Published: (2024)

EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents
by: Cheng, Zhili, et al.
Published: (2025)

Control Color: Multimodal Diffusion-based Interactive Image Colorization
by: Liang, Zhexin, et al.
Published: (2024)

Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation
by: Wang, Xintong, et al.
Published: (2025)

From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs
by: Toschi, Federico, et al.
Published: (2026)

Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation
by: Pennec, Galann, et al.
Published: (2025)

Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?
by: Song, Yingjin, et al.
Published: (2025)

ImageInWords: Unlocking Hyper-Detailed Image Descriptions
by: Garg, Roopal, et al.
Published: (2024)

Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
by: Xiang, Sike, et al.
Published: (2026)

VisText-Mosquito: A Unified Multimodal Dataset for Visual Detection, Segmentation, and Textual Explanation on Mosquito Breeding Sites
by: Islam, Md. Adnanul, et al.
Published: (2025)