Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kapur, Rhea, Hawkins, Robert, Kreiss, Elisa
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2601.04609
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911605265006592
author	Kapur, Rhea Hawkins, Robert Kreiss, Elisa
author_facet	Kapur, Rhea Hawkins, Robert Kreiss, Elisa
contents	Vision-language models (VLMs) are increasingly used to make visual content accessible via text-based descriptions. In current systems, however, description specificity is often conflated with their length. We argue that these two concepts must be disentangled: descriptions can be concise yet dense with information, or lengthy yet vacuous. We define specificity relative to a contrast set, where a description is more specific to the extent that it picks out the target image better than other possible images. We construct a dataset that controls for length while varying information content, and validate that people reliably prefer more specific descriptions regardless of length. We find that controlling for length alone cannot account for differences in specificity: how the length budget is allocated makes a difference. These results support evaluation approaches that directly prioritize specificity over verbosity.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_04609
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation Kapur, Rhea Hawkins, Robert Kreiss, Elisa Computation and Language Vision-language models (VLMs) are increasingly used to make visual content accessible via text-based descriptions. In current systems, however, description specificity is often conflated with their length. We argue that these two concepts must be disentangled: descriptions can be concise yet dense with information, or lengthy yet vacuous. We define specificity relative to a contrast set, where a description is more specific to the extent that it picks out the target image better than other possible images. We construct a dataset that controls for length while varying information content, and validate that people reliably prefer more specific descriptions regardless of length. We find that controlling for length alone cannot account for differences in specificity: how the length budget is allocated makes a difference. These results support evaluation approaches that directly prioritize specificity over verbosity.
title	When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation
topic	Computation and Language
url	https://arxiv.org/abs/2601.04609

Similar Items