Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Hergert, Lea, Berend, Gábor, Szegedy, Mario, Turan, Gyorgy, Jelasity, Márk
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2511.12728
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915621494587392
author	Hergert, Lea Berend, Gábor Szegedy, Mario Turan, Gyorgy Jelasity, Márk
author_facet	Hergert, Lea Berend, Gábor Szegedy, Mario Turan, Gyorgy Jelasity, Márk
contents	Large language models (LLMs) achieve superhuman performance on complex reasoning tasks, yet often fail on much simpler problems, raising concerns about their reliability and interpretability. We investigate this paradox through a focused study with two key design features: simplicity, to expose basic failure modes, and scale, to enable comprehensive controlled experiments. We focus on set membership queries -- among the most fundamental forms of reasoning -- using tasks like ``Is apple an element of the set \{pear, plum, apple, raspberry\}?''. We conduct a systematic empirical evaluation across prompt phrasing, semantic structure, element ordering, and model choice. Our large-scale analysis reveals that LLM performance on this elementary task is consistently brittle, and unpredictable across all dimensions, suggesting that the models' ``understanding'' of the set concept is fragmented and convoluted at best. Our work demonstrates that the large-scale experiments enabled by the simplicity of the problem allow us to map and analyze the failure modes comprehensively, making this approach a valuable methodology for LLM evaluation in general.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_12728
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	On the Brittleness of LLMs: A Journey around Set Membership Hergert, Lea Berend, Gábor Szegedy, Mario Turan, Gyorgy Jelasity, Márk Computation and Language Large language models (LLMs) achieve superhuman performance on complex reasoning tasks, yet often fail on much simpler problems, raising concerns about their reliability and interpretability. We investigate this paradox through a focused study with two key design features: simplicity, to expose basic failure modes, and scale, to enable comprehensive controlled experiments. We focus on set membership queries -- among the most fundamental forms of reasoning -- using tasks like ``Is apple an element of the set \{pear, plum, apple, raspberry\}?''. We conduct a systematic empirical evaluation across prompt phrasing, semantic structure, element ordering, and model choice. Our large-scale analysis reveals that LLM performance on this elementary task is consistently brittle, and unpredictable across all dimensions, suggesting that the models' ``understanding'' of the set concept is fragmented and convoluted at best. Our work demonstrates that the large-scale experiments enabled by the simplicity of the problem allow us to map and analyze the failure modes comprehensively, making this approach a valuable methodology for LLM evaluation in general.
title	On the Brittleness of LLMs: A Journey around Set Membership
topic	Computation and Language
url	https://arxiv.org/abs/2511.12728

Similar Items