Saved in:
Bibliographic Details
Main Authors: Sahoo, Subramanyam, Jain, Vinija, Vats, Saanidhya, Mohapatra, Siddharth, Min, Rui, Chadha, Aman, Chaudhary, Divya
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2512.00552
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909934304624640
author Sahoo, Subramanyam
Jain, Vinija
Vats, Saanidhya
Mohapatra, Siddharth
Min, Rui
Chadha, Aman
Chaudhary, Divya
author_facet Sahoo, Subramanyam
Jain, Vinija
Vats, Saanidhya
Mohapatra, Siddharth
Min, Rui
Chadha, Aman
Chaudhary, Divya
contents Current evaluation of mathematical reasoning in language models relies primarily on answer accuracy, potentially masking fundamental failures in logical computation. We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching through four complementary axes: forward-backward consistency, transitivity coverage, counterfactual sensitivity, and perturbation robustness. Through a case study applying this framework to Qwen3-0.6B on the MenatQA dataset, we reveal a striking disconnect between surface performance and reasoning fidelity. While the model achieves reasonable answer accuracy (70%+), it demonstrates poor backward consistency (15%), limited transitivity coverage (32.2%), and brittle sensitivity to perturbations. Our diagnostics expose reasoning failures invisible to traditional accuracy metrics, suggesting that this small model relies heavily on pattern matching rather than genuine logical computation. While our empirical findings are based on a single 600M-parameter model, the diagnostic framework itself is model-agnostic and generalizable. We release our evaluation protocols to enable the research community to assess reasoning fidelity across different model scales and architectures, moving beyond surface-level accuracy toward verifiable mathematical reasoning.
format Preprint
id arxiv_https___arxiv_org_abs_2512_00552
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity
Sahoo, Subramanyam
Jain, Vinija
Vats, Saanidhya
Mohapatra, Siddharth
Min, Rui
Chadha, Aman
Chaudhary, Divya
Computation and Language
Current evaluation of mathematical reasoning in language models relies primarily on answer accuracy, potentially masking fundamental failures in logical computation. We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching through four complementary axes: forward-backward consistency, transitivity coverage, counterfactual sensitivity, and perturbation robustness. Through a case study applying this framework to Qwen3-0.6B on the MenatQA dataset, we reveal a striking disconnect between surface performance and reasoning fidelity. While the model achieves reasonable answer accuracy (70%+), it demonstrates poor backward consistency (15%), limited transitivity coverage (32.2%), and brittle sensitivity to perturbations. Our diagnostics expose reasoning failures invisible to traditional accuracy metrics, suggesting that this small model relies heavily on pattern matching rather than genuine logical computation. While our empirical findings are based on a single 600M-parameter model, the diagnostic framework itself is model-agnostic and generalizable. We release our evaluation protocols to enable the research community to assess reasoning fidelity across different model scales and architectures, moving beyond surface-level accuracy toward verifiable mathematical reasoning.
title Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity
topic Computation and Language
url https://arxiv.org/abs/2512.00552