Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Do, Lam Thanh, Taleka, Bhagyashree, Bhutta, Hozaifa Ammar, Mailthody, Vikram Sharma, Chang, Kevin Chen-Chuan, Hwu, Wen-mei
Format:	Preprint
Published:	2026
Subjects:	Information Retrieval
Online Access:	https://arxiv.org/abs/2602.08070
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915784927739904
author	Do, Lam Thanh Taleka, Bhagyashree Bhutta, Hozaifa Ammar Mailthody, Vikram Sharma Chang, Kevin Chen-Chuan Hwu, Wen-mei
author_facet	Do, Lam Thanh Taleka, Bhagyashree Bhutta, Hozaifa Ammar Mailthody, Vikram Sharma Chang, Kevin Chen-Chuan Hwu, Wen-mei
contents	Static benchmarks for RAG systems often suffer from rapid saturation and require significant manual effort to maintain robustness. To address this, we present IRB, a framework for automatically generating benchmarks to evaluate the factuality of RAG systems. IRB employs a structured generation pipeline utilizing \textit{factual scaffold} and \textit{algorithmic scaffold}. We utilize IRB to construct a benchmark and evaluate frontier LLMs and retrievers. Our results demonstrate that IRB poses a significant challenge for frontier LLMs in the closed-book setting. Furthermore, our evaluation suggests that reasoning LLMs are more reliable, and that improving the retrieval component may yield more cost-effective gains in RAG system correctness than scaling the generator.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_08070
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	IRB: Automated Generation of Robust Factuality Benchmarks Do, Lam Thanh Taleka, Bhagyashree Bhutta, Hozaifa Ammar Mailthody, Vikram Sharma Chang, Kevin Chen-Chuan Hwu, Wen-mei Information Retrieval Static benchmarks for RAG systems often suffer from rapid saturation and require significant manual effort to maintain robustness. To address this, we present IRB, a framework for automatically generating benchmarks to evaluate the factuality of RAG systems. IRB employs a structured generation pipeline utilizing \textit{factual scaffold} and \textit{algorithmic scaffold}. We utilize IRB to construct a benchmark and evaluate frontier LLMs and retrievers. Our results demonstrate that IRB poses a significant challenge for frontier LLMs in the closed-book setting. Furthermore, our evaluation suggests that reasoning LLMs are more reliable, and that improving the retrieval component may yield more cost-effective gains in RAG system correctness than scaling the generator.
title	IRB: Automated Generation of Robust Factuality Benchmarks
topic	Information Retrieval
url	https://arxiv.org/abs/2602.08070

Similar Items