Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Carranza, Juan Miguel Navarro
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2510.08616
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914085442945024
author	Carranza, Juan Miguel Navarro
author_facet	Carranza, Juan Miguel Navarro
contents	Benchmark scores for Large Language Models (LLMs) can be inflated by memorization of test items or near duplicates. We present a simple, protocol that probes generalization by re-evaluating models on paraphrased versions of benchmark questions. Using Mistral-7B-Instruct and Qwen2.5-7B-Instruct, we measure the accuracy gap between original and paraphrased items on ARC-Easy and ARC-Challenge. Our pipeline controls decoding, enforces multiple-choice output format, and includes a robust paraphrase-cleaning step to preserve semantics. We find that paraphrasing induces a non-trivial accuracy drop (original vs. paraphrased), consistent with prior concerns about contamination and brittle surface-form shortcuts.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_08616
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	LLMs Show Surface-Form Brittleness Under Paraphrase Stress Tests Carranza, Juan Miguel Navarro Computation and Language Benchmark scores for Large Language Models (LLMs) can be inflated by memorization of test items or near duplicates. We present a simple, protocol that probes generalization by re-evaluating models on paraphrased versions of benchmark questions. Using Mistral-7B-Instruct and Qwen2.5-7B-Instruct, we measure the accuracy gap between original and paraphrased items on ARC-Easy and ARC-Challenge. Our pipeline controls decoding, enforces multiple-choice output format, and includes a robust paraphrase-cleaning step to preserve semantics. We find that paraphrasing induces a non-trivial accuracy drop (original vs. paraphrased), consistent with prior concerns about contamination and brittle surface-form shortcuts.
title	LLMs Show Surface-Form Brittleness Under Paraphrase Stress Tests
topic	Computation and Language
url	https://arxiv.org/abs/2510.08616

Similar Items