Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Abdallah, Abdelrahman, Holdcroft, Jamie, Ali, Mohammed, Jatowt, Adam
Format:	Preprint
Published:	2026
Subjects:	Information Retrieval
Online Access:	https://arxiv.org/abs/2604.03676
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914445745192960
author	Abdallah, Abdelrahman Holdcroft, Jamie Ali, Mohammed Jatowt, Adam
author_facet	Abdallah, Abdelrahman Holdcroft, Jamie Ali, Mohammed Jatowt, Adam
contents	Large language model retrievers improve performance on complex queries, but their practical value depends on efficiency, robustness, and reliable confidence signals in addition to accuracy. We reproduce a reasoning-intensive retrieval benchmark (BRIGHT) across 12 tasks and 14 retrievers, and extend evaluation with cold-start indexing cost, query latency distributions and throughput, corpus scaling, robustness to controlled query perturbations, and confidence use (AUROC) for predicting query success. We also quantify \emph{reasoning overhead} by comparing standard queries to five provided reasoning-augmented variants, measuring accuracy gains relative to added latency. We find that some reasoning-specialized retrievers achieve strong effectiveness while remaining competitive in throughput, whereas several large LLM-based bi-encoders incur substantial latency for modest gains. Reasoning augmentation incurs minimal latency for sub-1B encoders but exhibits diminishing returns for top retrievers and may reduce performance on formal math/code domains. Confidence calibration is consistently weak across model families, indicating that raw retrieval scores are unreliable for downstream routing without additional calibration. We release all code and artifacts for reproducibility.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_03676
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Are LLM-Based Retrievers Worth Their Cost? An Empirical Study of Efficiency, Robustness, and Reasoning Overhead Abdallah, Abdelrahman Holdcroft, Jamie Ali, Mohammed Jatowt, Adam Information Retrieval Large language model retrievers improve performance on complex queries, but their practical value depends on efficiency, robustness, and reliable confidence signals in addition to accuracy. We reproduce a reasoning-intensive retrieval benchmark (BRIGHT) across 12 tasks and 14 retrievers, and extend evaluation with cold-start indexing cost, query latency distributions and throughput, corpus scaling, robustness to controlled query perturbations, and confidence use (AUROC) for predicting query success. We also quantify \emph{reasoning overhead} by comparing standard queries to five provided reasoning-augmented variants, measuring accuracy gains relative to added latency. We find that some reasoning-specialized retrievers achieve strong effectiveness while remaining competitive in throughput, whereas several large LLM-based bi-encoders incur substantial latency for modest gains. Reasoning augmentation incurs minimal latency for sub-1B encoders but exhibits diminishing returns for top retrievers and may reduce performance on formal math/code domains. Confidence calibration is consistently weak across model families, indicating that raw retrieval scores are unreliable for downstream routing without additional calibration. We release all code and artifacts for reproducibility.
title	Are LLM-Based Retrievers Worth Their Cost? An Empirical Study of Efficiency, Robustness, and Reasoning Overhead
topic	Information Retrieval
url	https://arxiv.org/abs/2604.03676

Similar Items