Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Kim, Dongjun, Shim, Gyuho, Chun, Yongchan, Kim, Minhyuk, Park, Chanjun, Lim, Heuiseok
Format:	Preprint
Publié:	2025
Sujets:	Computation and Language Artificial Intelligence
Accès en ligne:	https://arxiv.org/abs/2510.01232
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866908572793700352
author	Kim, Dongjun Shim, Gyuho Chun, Yongchan Kim, Minhyuk Park, Chanjun Lim, Heuiseok
author_facet	Kim, Dongjun Shim, Gyuho Chun, Yongchan Kim, Minhyuk Park, Chanjun Lim, Heuiseok
contents	Large Language Models are commonly judged by their scores on standard benchmarks, yet such scores often overstate real capability since they mask the mix of skills a task actually demands. For example, ARC is assumed to test reasoning, while HellaSwag is designed to evaluate commonsense. However, we lack a systematic way to verify if these benchmarks actually measure these labels. We introduce Benchmark Profiling, a diagnostic framework that decomposes benchmark performance into ten cognitively grounded abilities. The method combines gradient-based importance scoring with targeted parameter ablation to compute an Ability Impact Score (AIS) that quantifies how much each ability contributes to a model's success on a given benchmark. Profiling three instruction-tuned models across ten widely used benchmarks yields four key findings: (i) most benchmarks draw on several abilities rather than one, (ii) datasets with similar labels rely on distinct ability mixtures, (iii) code-generation benchmarks reward broad, multi-skill improvement and thus show only modest gains from narrow domain-specific fine-tuning, and (iv) abilities irrelevant to the task could negatively affect performance. Benchmark Profiling therefore explains why performance gains do not always translate into user-perceived competence and offers a transparent tool for benchmark audit and model interpretability.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_01232
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks Kim, Dongjun Shim, Gyuho Chun, Yongchan Kim, Minhyuk Park, Chanjun Lim, Heuiseok Computation and Language Artificial Intelligence Large Language Models are commonly judged by their scores on standard benchmarks, yet such scores often overstate real capability since they mask the mix of skills a task actually demands. For example, ARC is assumed to test reasoning, while HellaSwag is designed to evaluate commonsense. However, we lack a systematic way to verify if these benchmarks actually measure these labels. We introduce Benchmark Profiling, a diagnostic framework that decomposes benchmark performance into ten cognitively grounded abilities. The method combines gradient-based importance scoring with targeted parameter ablation to compute an Ability Impact Score (AIS) that quantifies how much each ability contributes to a model's success on a given benchmark. Profiling three instruction-tuned models across ten widely used benchmarks yields four key findings: (i) most benchmarks draw on several abilities rather than one, (ii) datasets with similar labels rely on distinct ability mixtures, (iii) code-generation benchmarks reward broad, multi-skill improvement and thus show only modest gains from narrow domain-specific fine-tuning, and (iv) abilities irrelevant to the task could negatively affect performance. Benchmark Profiling therefore explains why performance gains do not always translate into user-perceived competence and offers a transparent tool for benchmark audit and model interpretability.
title	Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2510.01232

Documents similaires