Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Hagström, Lovisa, Nie, Ercong, Halifa, Ruben, Schmid, Helmut, Johansson, Richard, Junge, Alexander
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2502.17036
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911019991826432
author	Hagström, Lovisa Nie, Ercong Halifa, Ruben Schmid, Helmut Johansson, Richard Junge, Alexander
author_facet	Hagström, Lovisa Nie, Ercong Halifa, Ruben Schmid, Helmut Johansson, Richard Junge, Alexander
contents	Language model (LM) re-rankers are used to refine retrieval results for retrieval-augmented generation (RAG). They are more expensive than lexical matching methods like BM25 but assumed to better process semantic information and the relations between the query and the retrieved answers. To understand whether LM re-rankers always live up to this assumption, we evaluate 6 different LM re-rankers on the NQ, LitQA2 and DRUID datasets. Our results show that LM re-rankers struggle to outperform a simple BM25 baseline on DRUID. Leveraging a novel separation metric based on BM25 scores, we explain and identify re-ranker errors stemming from lexical dissimilarities. We also investigate different methods to improve LM re-ranker performance and find these methods mainly useful for NQ. Taken together, our work identifies and explains weaknesses of LM re-rankers and points to the need for more adversarial and realistic datasets for their evaluation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_17036
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Language Model Re-rankers are Fooled by Lexical Similarities Hagström, Lovisa Nie, Ercong Halifa, Ruben Schmid, Helmut Johansson, Richard Junge, Alexander Computation and Language Artificial Intelligence Language model (LM) re-rankers are used to refine retrieval results for retrieval-augmented generation (RAG). They are more expensive than lexical matching methods like BM25 but assumed to better process semantic information and the relations between the query and the retrieved answers. To understand whether LM re-rankers always live up to this assumption, we evaluate 6 different LM re-rankers on the NQ, LitQA2 and DRUID datasets. Our results show that LM re-rankers struggle to outperform a simple BM25 baseline on DRUID. Leveraging a novel separation metric based on BM25 scores, we explain and identify re-ranker errors stemming from lexical dissimilarities. We also investigate different methods to improve LM re-ranker performance and find these methods mainly useful for NQ. Taken together, our work identifies and explains weaknesses of LM re-rankers and points to the need for more adversarial and realistic datasets for their evaluation.
title	Language Model Re-rankers are Fooled by Lexical Similarities
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2502.17036

Similar Items