Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Nowicki, Filip, Marciniak, Hubert, Łączkowski, Jakub, Jassem, Krzysztof, Górecki, Tomasz, Balakrishnan, Vimala, Ong, Desmond C., Behnke, Maciej
Format:	Preprint
Published:	2026
Subjects:	Human-Computer Interaction Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2602.00123
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910006590308352
author	Nowicki, Filip Marciniak, Hubert Łączkowski, Jakub Jassem, Krzysztof Górecki, Tomasz Balakrishnan, Vimala Ong, Desmond C. Behnke, Maciej
author_facet	Nowicki, Filip Marciniak, Hubert Łączkowski, Jakub Jassem, Krzysztof Górecki, Tomasz Balakrishnan, Vimala Ong, Desmond C. Behnke, Maciej
contents	Vision-language models (VLMs) show promise as tools for inferring affect from visual stimuli at scale; it is not yet clear how closely their outputs align with human affective ratings. We benchmarked nine VLMs, ranging from state-of-the-art proprietary models to open-source models, on three psycho-metrically validated affective image datasets: the International Affective Picture System, the Nencki Affective Picture System, and the Library of AI-Generated Affective Images. The models performed two tasks in the zero-shot setting: (i) top-emotion classification (selecting the strongest discrete emotion elicited by an image) and (ii) continuous prediction of human ratings on 1-7/9 Likert scales for discrete emotion categories and affective dimensions. We also evaluated the impact of rater-conditioned prompting on the LAI-GAI dataset using de-identified participant metadata. The results show good performance in discrete emotion classification, with accuracies typically ranging from 60% to 80% on six-emotion labels and from 60% to 75% on a more challenging 12-category task. The predictions of anger and surprise had the lowest accuracy in all datasets. For continuous rating prediction, models showed moderate to strong alignment with humans (r > 0.75) but also exhibited consistent biases, notably weaker performance on arousal, and a tendency to overestimate response strength. Rater-conditioned prompting resulted in only small, inconsistent changes in predictions. Overall, VLMs capture broad affective trends but lack the nuance found in validated psychological ratings, highlighting their potential and current limitations for affective computing and mental health-related applications.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_00123
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Visual Affect Analysis: Predicting Emotions of Image Viewers with Vision-Language Models Nowicki, Filip Marciniak, Hubert Łączkowski, Jakub Jassem, Krzysztof Górecki, Tomasz Balakrishnan, Vimala Ong, Desmond C. Behnke, Maciej Human-Computer Interaction Computer Vision and Pattern Recognition Vision-language models (VLMs) show promise as tools for inferring affect from visual stimuli at scale; it is not yet clear how closely their outputs align with human affective ratings. We benchmarked nine VLMs, ranging from state-of-the-art proprietary models to open-source models, on three psycho-metrically validated affective image datasets: the International Affective Picture System, the Nencki Affective Picture System, and the Library of AI-Generated Affective Images. The models performed two tasks in the zero-shot setting: (i) top-emotion classification (selecting the strongest discrete emotion elicited by an image) and (ii) continuous prediction of human ratings on 1-7/9 Likert scales for discrete emotion categories and affective dimensions. We also evaluated the impact of rater-conditioned prompting on the LAI-GAI dataset using de-identified participant metadata. The results show good performance in discrete emotion classification, with accuracies typically ranging from 60% to 80% on six-emotion labels and from 60% to 75% on a more challenging 12-category task. The predictions of anger and surprise had the lowest accuracy in all datasets. For continuous rating prediction, models showed moderate to strong alignment with humans (r > 0.75) but also exhibited consistent biases, notably weaker performance on arousal, and a tendency to overestimate response strength. Rater-conditioned prompting resulted in only small, inconsistent changes in predictions. Overall, VLMs capture broad affective trends but lack the nuance found in validated psychological ratings, highlighting their potential and current limitations for affective computing and mental health-related applications.
title	Visual Affect Analysis: Predicting Emotions of Image Viewers with Vision-Language Models
topic	Human-Computer Interaction Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2602.00123

Similar Items