Saved in:
Bibliographic Details
Main Authors: Do, Lam Thanh, Taleka, Bhagyashree, Bhutta, Hozaifa Ammar, Mailthody, Vikram Sharma, Chang, Kevin Chen-Chuan, Hwu, Wen-mei
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.08070
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915784927739904
author Do, Lam Thanh
Taleka, Bhagyashree
Bhutta, Hozaifa Ammar
Mailthody, Vikram Sharma
Chang, Kevin Chen-Chuan
Hwu, Wen-mei
author_facet Do, Lam Thanh
Taleka, Bhagyashree
Bhutta, Hozaifa Ammar
Mailthody, Vikram Sharma
Chang, Kevin Chen-Chuan
Hwu, Wen-mei
contents Static benchmarks for RAG systems often suffer from rapid saturation and require significant manual effort to maintain robustness. To address this, we present IRB, a framework for automatically generating benchmarks to evaluate the factuality of RAG systems. IRB employs a structured generation pipeline utilizing \textit{factual scaffold} and \textit{algorithmic scaffold}. We utilize IRB to construct a benchmark and evaluate frontier LLMs and retrievers. Our results demonstrate that IRB poses a significant challenge for frontier LLMs in the closed-book setting. Furthermore, our evaluation suggests that reasoning LLMs are more reliable, and that improving the retrieval component may yield more cost-effective gains in RAG system correctness than scaling the generator.
format Preprint
id arxiv_https___arxiv_org_abs_2602_08070
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle IRB: Automated Generation of Robust Factuality Benchmarks
Do, Lam Thanh
Taleka, Bhagyashree
Bhutta, Hozaifa Ammar
Mailthody, Vikram Sharma
Chang, Kevin Chen-Chuan
Hwu, Wen-mei
Information Retrieval
Static benchmarks for RAG systems often suffer from rapid saturation and require significant manual effort to maintain robustness. To address this, we present IRB, a framework for automatically generating benchmarks to evaluate the factuality of RAG systems. IRB employs a structured generation pipeline utilizing \textit{factual scaffold} and \textit{algorithmic scaffold}. We utilize IRB to construct a benchmark and evaluate frontier LLMs and retrievers. Our results demonstrate that IRB poses a significant challenge for frontier LLMs in the closed-book setting. Furthermore, our evaluation suggests that reasoning LLMs are more reliable, and that improving the retrieval component may yield more cost-effective gains in RAG system correctness than scaling the generator.
title IRB: Automated Generation of Robust Factuality Benchmarks
topic Information Retrieval
url https://arxiv.org/abs/2602.08070