Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Guan, Batu, Wu, Xiao, Yuan, Yuanyuan, Li, Shaohua
Format:	Preprint
Published:	2025
Subjects:	Software Engineering Computation and Language
Online Access:	https://arxiv.org/abs/2503.06643
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

In this paper, we tackle a critical challenge in model evaluation: how to keep code benchmarks useful when models might have already seen them during training. We introduce a novel solution, dynamic benchmarking framework, to address this challenge. Given a code understanding or reasoning benchmark, our framework dynamically transforms each input, i.e., programs, with various semantic-preserving mutations to build a syntactically new while semantically identical benchmark. We evaluated ten popular language models on our dynamic benchmarks. Our evaluation reveals several interesting or surprising findings: (1) all models perform significantly worse than before, (2) the ranking between some models shifts dramatically, and (3) our dynamic benchmarks can resist against the data contamination problem.

Similar Items