Saved in:
Bibliographic Details
Main Authors: Kapur, Rhea, Hawkins, Robert, Kreiss, Elisa
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.04609
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911605265006592
author Kapur, Rhea
Hawkins, Robert
Kreiss, Elisa
author_facet Kapur, Rhea
Hawkins, Robert
Kreiss, Elisa
contents Vision-language models (VLMs) are increasingly used to make visual content accessible via text-based descriptions. In current systems, however, description specificity is often conflated with their length. We argue that these two concepts must be disentangled: descriptions can be concise yet dense with information, or lengthy yet vacuous. We define specificity relative to a contrast set, where a description is more specific to the extent that it picks out the target image better than other possible images. We construct a dataset that controls for length while varying information content, and validate that people reliably prefer more specific descriptions regardless of length. We find that controlling for length alone cannot account for differences in specificity: how the length budget is allocated makes a difference. These results support evaluation approaches that directly prioritize specificity over verbosity.
format Preprint
id arxiv_https___arxiv_org_abs_2601_04609
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation
Kapur, Rhea
Hawkins, Robert
Kreiss, Elisa
Computation and Language
Vision-language models (VLMs) are increasingly used to make visual content accessible via text-based descriptions. In current systems, however, description specificity is often conflated with their length. We argue that these two concepts must be disentangled: descriptions can be concise yet dense with information, or lengthy yet vacuous. We define specificity relative to a contrast set, where a description is more specific to the extent that it picks out the target image better than other possible images. We construct a dataset that controls for length while varying information content, and validate that people reliably prefer more specific descriptions regardless of length. We find that controlling for length alone cannot account for differences in specificity: how the length budget is allocated makes a difference. These results support evaluation approaches that directly prioritize specificity over verbosity.
title When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation
topic Computation and Language
url https://arxiv.org/abs/2601.04609