Saved in:
Bibliographic Details
Main Author: Carranza, Juan Miguel Navarro
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.08616
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914085442945024
author Carranza, Juan Miguel Navarro
author_facet Carranza, Juan Miguel Navarro
contents Benchmark scores for Large Language Models (LLMs) can be inflated by memorization of test items or near duplicates. We present a simple, protocol that probes generalization by re-evaluating models on paraphrased versions of benchmark questions. Using Mistral-7B-Instruct and Qwen2.5-7B-Instruct, we measure the accuracy gap between original and paraphrased items on ARC-Easy and ARC-Challenge. Our pipeline controls decoding, enforces multiple-choice output format, and includes a robust paraphrase-cleaning step to preserve semantics. We find that paraphrasing induces a non-trivial accuracy drop (original vs. paraphrased), consistent with prior concerns about contamination and brittle surface-form shortcuts.
format Preprint
id arxiv_https___arxiv_org_abs_2510_08616
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle LLMs Show Surface-Form Brittleness Under Paraphrase Stress Tests
Carranza, Juan Miguel Navarro
Computation and Language
Benchmark scores for Large Language Models (LLMs) can be inflated by memorization of test items or near duplicates. We present a simple, protocol that probes generalization by re-evaluating models on paraphrased versions of benchmark questions. Using Mistral-7B-Instruct and Qwen2.5-7B-Instruct, we measure the accuracy gap between original and paraphrased items on ARC-Easy and ARC-Challenge. Our pipeline controls decoding, enforces multiple-choice output format, and includes a robust paraphrase-cleaning step to preserve semantics. We find that paraphrasing induces a non-trivial accuracy drop (original vs. paraphrased), consistent with prior concerns about contamination and brittle surface-form shortcuts.
title LLMs Show Surface-Form Brittleness Under Paraphrase Stress Tests
topic Computation and Language
url https://arxiv.org/abs/2510.08616