Saved in:
Bibliographic Details
Main Authors: Ortega, Jorge Chang, Lan, Bastien Le, Serre, Thomas, Boutin, Victor
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.23819
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913156994957312
author Ortega, Jorge Chang
Lan, Bastien Le
Serre, Thomas
Boutin, Victor
author_facet Ortega, Jorge Chang
Lan, Bastien Le
Serre, Thomas
Boutin, Victor
contents A central question in computational vision is whether human-like visual representations are better explained by discriminative or generative learning. Existing comparisons, however, often confound the learning objective with architecture, scale, and training data, leaving open whether the objective itself drives alignment. We address this confound using Joint Energy-Based Models (JEMs), which interpolate continuously between discriminative and generative training within a fixed architecture. By varying a single mixing coefficient, we isolate the effect of the learning objective and evaluate the resulting models across six human-alignment benchmarks spanning perceptual similarity, gloss perception, human response uncertainty, robustness, shape-texture cue conflict, and diagnostic feature attribution. Across this diverse suite, human alignment is consistently maximized at intermediate points of the generative-discriminative continuum, rather than at either endpoint. Hybrid JEMs combine the categorical structure induced by discriminative learning with the sensitivity to input structure induced by generative learning, yielding more human-like behavior across multiple levels of vision. These results suggest that the generative-discriminative dichotomy is the wrong axis for understanding human-aligned vision: alignment emerges not from choosing one objective over the other, but from balancing both.
format Preprint
id arxiv_https___arxiv_org_abs_2605_23819
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Not Too Generative, Not Too Discriminative: The Human Alignment Sweet Spot
Ortega, Jorge Chang
Lan, Bastien Le
Serre, Thomas
Boutin, Victor
Computer Vision and Pattern Recognition
Artificial Intelligence
A central question in computational vision is whether human-like visual representations are better explained by discriminative or generative learning. Existing comparisons, however, often confound the learning objective with architecture, scale, and training data, leaving open whether the objective itself drives alignment. We address this confound using Joint Energy-Based Models (JEMs), which interpolate continuously between discriminative and generative training within a fixed architecture. By varying a single mixing coefficient, we isolate the effect of the learning objective and evaluate the resulting models across six human-alignment benchmarks spanning perceptual similarity, gloss perception, human response uncertainty, robustness, shape-texture cue conflict, and diagnostic feature attribution. Across this diverse suite, human alignment is consistently maximized at intermediate points of the generative-discriminative continuum, rather than at either endpoint. Hybrid JEMs combine the categorical structure induced by discriminative learning with the sensitivity to input structure induced by generative learning, yielding more human-like behavior across multiple levels of vision. These results suggest that the generative-discriminative dichotomy is the wrong axis for understanding human-aligned vision: alignment emerges not from choosing one objective over the other, but from balancing both.
title Not Too Generative, Not Too Discriminative: The Human Alignment Sweet Spot
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2605.23819