Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Borah, Abhilekh, Ghosh, Shubhra, Joshi, Kedar, Guru, Aditya Kumar, Ghosh, Kripabandhu
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2602.01132
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917239481958400
author	Borah, Abhilekh Ghosh, Shubhra Joshi, Kedar Guru, Aditya Kumar Ghosh, Kripabandhu
author_facet	Borah, Abhilekh Ghosh, Shubhra Joshi, Kedar Guru, Aditya Kumar Ghosh, Kripabandhu
contents	Tasks such as solving arithmetic equations, evaluating truth tables, and completing syllogisms are handled well by large language models (LLMs) in their standard form, but they often fail when the same problems are posed in logically equivalent yet obfuscated formats. To study this vulnerability, we introduce Logifus, a structure-preserving logical obfuscation framework, and, utilizing this, we present LogiQAte, a first-of-its-kind diagnostic benchmark with 1,108 questions across four reasoning tasks: (i) Obfus FOL (first-order logic entailment under equivalence-preserving rewrites), (ii) Obfus Blood Relation (family-graph entailment under indirect relational chains), (iii) Obfus Number Series (pattern induction under symbolic substitutions), and (iv) Obfus Direction Sense (navigation reasoning under altered directions and reference frames). Across all the tasks, evaluating six state-of-the-art models, we find that obfuscation severely degrades zero-shot performance, with performance dropping on average by 47% for GPT-4o, 27% for GPT-5, and 22% for reasoning model, o4-mini. Our findings reveal that current LLMs parse questions without deep understanding, highlighting the urgency of building models that genuinely comprehend and preserve meaning beyond surface form.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_01132
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Don't Judge a Book by its Cover: Testing LLMs' Robustness Under Logical Obfuscation Borah, Abhilekh Ghosh, Shubhra Joshi, Kedar Guru, Aditya Kumar Ghosh, Kripabandhu Computation and Language Tasks such as solving arithmetic equations, evaluating truth tables, and completing syllogisms are handled well by large language models (LLMs) in their standard form, but they often fail when the same problems are posed in logically equivalent yet obfuscated formats. To study this vulnerability, we introduce Logifus, a structure-preserving logical obfuscation framework, and, utilizing this, we present LogiQAte, a first-of-its-kind diagnostic benchmark with 1,108 questions across four reasoning tasks: (i) Obfus FOL (first-order logic entailment under equivalence-preserving rewrites), (ii) Obfus Blood Relation (family-graph entailment under indirect relational chains), (iii) Obfus Number Series (pattern induction under symbolic substitutions), and (iv) Obfus Direction Sense (navigation reasoning under altered directions and reference frames). Across all the tasks, evaluating six state-of-the-art models, we find that obfuscation severely degrades zero-shot performance, with performance dropping on average by 47% for GPT-4o, 27% for GPT-5, and 22% for reasoning model, o4-mini. Our findings reveal that current LLMs parse questions without deep understanding, highlighting the urgency of building models that genuinely comprehend and preserve meaning beyond surface form.
title	Don't Judge a Book by its Cover: Testing LLMs' Robustness Under Logical Obfuscation
topic	Computation and Language
url	https://arxiv.org/abs/2602.01132

Similar Items