Saved in:
Bibliographic Details
Main Authors: Hergert, Lea, Berend, Gábor, Szegedy, Mario, Turan, Gyorgy, Jelasity, Márk
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.12728
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915621494587392
author Hergert, Lea
Berend, Gábor
Szegedy, Mario
Turan, Gyorgy
Jelasity, Márk
author_facet Hergert, Lea
Berend, Gábor
Szegedy, Mario
Turan, Gyorgy
Jelasity, Márk
contents Large language models (LLMs) achieve superhuman performance on complex reasoning tasks, yet often fail on much simpler problems, raising concerns about their reliability and interpretability. We investigate this paradox through a focused study with two key design features: simplicity, to expose basic failure modes, and scale, to enable comprehensive controlled experiments. We focus on set membership queries -- among the most fundamental forms of reasoning -- using tasks like ``Is apple an element of the set \{pear, plum, apple, raspberry\}?''. We conduct a systematic empirical evaluation across prompt phrasing, semantic structure, element ordering, and model choice. Our large-scale analysis reveals that LLM performance on this elementary task is consistently brittle, and unpredictable across all dimensions, suggesting that the models' ``understanding'' of the set concept is fragmented and convoluted at best. Our work demonstrates that the large-scale experiments enabled by the simplicity of the problem allow us to map and analyze the failure modes comprehensively, making this approach a valuable methodology for LLM evaluation in general.
format Preprint
id arxiv_https___arxiv_org_abs_2511_12728
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle On the Brittleness of LLMs: A Journey around Set Membership
Hergert, Lea
Berend, Gábor
Szegedy, Mario
Turan, Gyorgy
Jelasity, Márk
Computation and Language
Large language models (LLMs) achieve superhuman performance on complex reasoning tasks, yet often fail on much simpler problems, raising concerns about their reliability and interpretability. We investigate this paradox through a focused study with two key design features: simplicity, to expose basic failure modes, and scale, to enable comprehensive controlled experiments. We focus on set membership queries -- among the most fundamental forms of reasoning -- using tasks like ``Is apple an element of the set \{pear, plum, apple, raspberry\}?''. We conduct a systematic empirical evaluation across prompt phrasing, semantic structure, element ordering, and model choice. Our large-scale analysis reveals that LLM performance on this elementary task is consistently brittle, and unpredictable across all dimensions, suggesting that the models' ``understanding'' of the set concept is fragmented and convoluted at best. Our work demonstrates that the large-scale experiments enabled by the simplicity of the problem allow us to map and analyze the failure modes comprehensively, making this approach a valuable methodology for LLM evaluation in general.
title On the Brittleness of LLMs: A Journey around Set Membership
topic Computation and Language
url https://arxiv.org/abs/2511.12728