Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Sahoo, Subramanyam, Jain, Vinija, Vats, Saanidhya, Mohapatra, Siddharth, Min, Rui, Chadha, Aman, Chaudhary, Divya
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2512.00552
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909934304624640
author	Sahoo, Subramanyam Jain, Vinija Vats, Saanidhya Mohapatra, Siddharth Min, Rui Chadha, Aman Chaudhary, Divya
author_facet	Sahoo, Subramanyam Jain, Vinija Vats, Saanidhya Mohapatra, Siddharth Min, Rui Chadha, Aman Chaudhary, Divya
contents	Current evaluation of mathematical reasoning in language models relies primarily on answer accuracy, potentially masking fundamental failures in logical computation. We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching through four complementary axes: forward-backward consistency, transitivity coverage, counterfactual sensitivity, and perturbation robustness. Through a case study applying this framework to Qwen3-0.6B on the MenatQA dataset, we reveal a striking disconnect between surface performance and reasoning fidelity. While the model achieves reasonable answer accuracy (70%+), it demonstrates poor backward consistency (15%), limited transitivity coverage (32.2%), and brittle sensitivity to perturbations. Our diagnostics expose reasoning failures invisible to traditional accuracy metrics, suggesting that this small model relies heavily on pattern matching rather than genuine logical computation. While our empirical findings are based on a single 600M-parameter model, the diagnostic framework itself is model-agnostic and generalizable. We release our evaluation protocols to enable the research community to assess reasoning fidelity across different model scales and architectures, moving beyond surface-level accuracy toward verifiable mathematical reasoning.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_00552
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity Sahoo, Subramanyam Jain, Vinija Vats, Saanidhya Mohapatra, Siddharth Min, Rui Chadha, Aman Chaudhary, Divya Computation and Language Current evaluation of mathematical reasoning in language models relies primarily on answer accuracy, potentially masking fundamental failures in logical computation. We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching through four complementary axes: forward-backward consistency, transitivity coverage, counterfactual sensitivity, and perturbation robustness. Through a case study applying this framework to Qwen3-0.6B on the MenatQA dataset, we reveal a striking disconnect between surface performance and reasoning fidelity. While the model achieves reasonable answer accuracy (70%+), it demonstrates poor backward consistency (15%), limited transitivity coverage (32.2%), and brittle sensitivity to perturbations. Our diagnostics expose reasoning failures invisible to traditional accuracy metrics, suggesting that this small model relies heavily on pattern matching rather than genuine logical computation. While our empirical findings are based on a single 600M-parameter model, the diagnostic framework itself is model-agnostic and generalizable. We release our evaluation protocols to enable the research community to assess reasoning fidelity across different model scales and architectures, moving beyond surface-level accuracy toward verifiable mathematical reasoning.
title	Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity
topic	Computation and Language
url	https://arxiv.org/abs/2512.00552

Similar Items