Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Staudinger, Moritz, Kusa, Wojciech, Hanbury, Allan
Format:	Preprint
Published:	2026
Subjects:	Information Retrieval
Online Access:	https://arxiv.org/abs/2604.05766
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914451918159872
author	Staudinger, Moritz Kusa, Wojciech Hanbury, Allan
author_facet	Staudinger, Moritz Kusa, Wojciech Hanbury, Allan
contents	Benchmark collections have long enabled controlled comparison and cumulative progress in Information Retrieval (IR). However, prior meta-analyses have shown that reported effectiveness gains often fail to accumulate, in part due to the use of weak or outdated baselines. While large language models are increasingly used in retrieval pipelines, their impact on established IR benchmarks has not been systematically analyzed. In this study, we analyze 143 publications reporting results on the TREC Robust04 collection and the TREC Deep Learning 2020 (DL20) passage retrieval benchmark to examine longitudinal trends in retrieval effectiveness and baseline strength. We observe what we term an \emph{LLM effect}: recent systems incorporating LLM components achieve 8.8\% higher nDCG@10 on DL20 compared to the best result from TREC 2020 and approximately 20\% higher on Robust04 since 2023. However, adapting a data contamination detection approach to reranking reveals measurable contamination in both benchmarks. While excluding contaminated topics reduces effectiveness, confidence intervals remain wide, making it difficult to determine whether the LLM effect reflects genuine methodological advances or memorization from pretraining data.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_05766
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	The LLM Effect on IR Benchmarks: A Meta-Analysis of Effectiveness, Baselines, and Contamination Staudinger, Moritz Kusa, Wojciech Hanbury, Allan Information Retrieval Benchmark collections have long enabled controlled comparison and cumulative progress in Information Retrieval (IR). However, prior meta-analyses have shown that reported effectiveness gains often fail to accumulate, in part due to the use of weak or outdated baselines. While large language models are increasingly used in retrieval pipelines, their impact on established IR benchmarks has not been systematically analyzed. In this study, we analyze 143 publications reporting results on the TREC Robust04 collection and the TREC Deep Learning 2020 (DL20) passage retrieval benchmark to examine longitudinal trends in retrieval effectiveness and baseline strength. We observe what we term an \emph{LLM effect}: recent systems incorporating LLM components achieve 8.8\% higher nDCG@10 on DL20 compared to the best result from TREC 2020 and approximately 20\% higher on Robust04 since 2023. However, adapting a data contamination detection approach to reranking reveals measurable contamination in both benchmarks. While excluding contaminated topics reduces effectiveness, confidence intervals remain wide, making it difficult to determine whether the LLM effect reflects genuine methodological advances or memorization from pretraining data.
title	The LLM Effect on IR Benchmarks: A Meta-Analysis of Effectiveness, Baselines, and Contamination
topic	Information Retrieval
url	https://arxiv.org/abs/2604.05766

Similar Items