Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Zheng, Xinyue, Lin, Haowei, Cai, Shaofei, Zheng, Zilong, Yang, Yaodong, Liang, Yitao
Formato:	Preprint
Publicado:	2025
Materias:	Software Engineering
Acceso en línea:	https://arxiv.org/abs/2510.17868
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866910021848137728
author	Zheng, Xinyue Lin, Haowei Cai, Shaofei Zheng, Zilong Yang, Yaodong Liang, Yitao
author_facet	Zheng, Xinyue Lin, Haowei Cai, Shaofei Zheng, Zilong Yang, Yaodong Liang, Yitao
contents	Current coding benchmarks often inflate Large Language Model (LLM) capabilities due to static paradigms and data contamination, enabling models to exploit statistical shortcuts rather than genuine reasoning. To address this, we introduce UniCode, a generative evaluation framework that systematically probes LLM limits via: (1) multi-dimensional augmentation transforming seed problems into complex variations to disrupt fixed algorithmic patterns; (2) a highly reliable, automated test generation pipeline for scalable evaluation; and (3) fine-grained metrics for rich error signals. Experiments reveal a 31.2% performance collapse in state-of-the-art models on UniCode, primarily driven by deficiencies in conceptual modeling and scalability reasoning rather than syntactic errors. Furthermore, we uncover a seed-problem regression where models revert to memorized seed logic rather than following new specifications, signaling a reliance on shortcuts over reasoning. This work validates UniCode as a robust framework to expose model fragility and foster reasoning-oriented code intelligence.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_17868
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	UniCode: Augmenting Evaluation for Code Reasoning Zheng, Xinyue Lin, Haowei Cai, Shaofei Zheng, Zilong Yang, Yaodong Liang, Yitao Software Engineering Current coding benchmarks often inflate Large Language Model (LLM) capabilities due to static paradigms and data contamination, enabling models to exploit statistical shortcuts rather than genuine reasoning. To address this, we introduce UniCode, a generative evaluation framework that systematically probes LLM limits via: (1) multi-dimensional augmentation transforming seed problems into complex variations to disrupt fixed algorithmic patterns; (2) a highly reliable, automated test generation pipeline for scalable evaluation; and (3) fine-grained metrics for rich error signals. Experiments reveal a 31.2% performance collapse in state-of-the-art models on UniCode, primarily driven by deficiencies in conceptual modeling and scalability reasoning rather than syntactic errors. Furthermore, we uncover a seed-problem regression where models revert to memorized seed logic rather than following new specifications, signaling a reliance on shortcuts over reasoning. This work validates UniCode as a robust framework to expose model fragility and foster reasoning-oriented code intelligence.
title	UniCode: Augmenting Evaluation for Code Reasoning
topic	Software Engineering
url	https://arxiv.org/abs/2510.17868

Ejemplares similares