Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.08070 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866915784927739904 |
|---|---|
| author | Do, Lam Thanh Taleka, Bhagyashree Bhutta, Hozaifa Ammar Mailthody, Vikram Sharma Chang, Kevin Chen-Chuan Hwu, Wen-mei |
| author_facet | Do, Lam Thanh Taleka, Bhagyashree Bhutta, Hozaifa Ammar Mailthody, Vikram Sharma Chang, Kevin Chen-Chuan Hwu, Wen-mei |
| contents | Static benchmarks for RAG systems often suffer from rapid saturation and require significant manual effort to maintain robustness. To address this, we present IRB, a framework for automatically generating benchmarks to evaluate the factuality of RAG systems. IRB employs a structured generation pipeline utilizing \textit{factual scaffold} and \textit{algorithmic scaffold}. We utilize IRB to construct a benchmark and evaluate frontier LLMs and retrievers. Our results demonstrate that IRB poses a significant challenge for frontier LLMs in the closed-book setting. Furthermore, our evaluation suggests that reasoning LLMs are more reliable, and that improving the retrieval component may yield more cost-effective gains in RAG system correctness than scaling the generator. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_08070 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | IRB: Automated Generation of Robust Factuality Benchmarks Do, Lam Thanh Taleka, Bhagyashree Bhutta, Hozaifa Ammar Mailthody, Vikram Sharma Chang, Kevin Chen-Chuan Hwu, Wen-mei Information Retrieval Static benchmarks for RAG systems often suffer from rapid saturation and require significant manual effort to maintain robustness. To address this, we present IRB, a framework for automatically generating benchmarks to evaluate the factuality of RAG systems. IRB employs a structured generation pipeline utilizing \textit{factual scaffold} and \textit{algorithmic scaffold}. We utilize IRB to construct a benchmark and evaluate frontier LLMs and retrievers. Our results demonstrate that IRB poses a significant challenge for frontier LLMs in the closed-book setting. Furthermore, our evaluation suggests that reasoning LLMs are more reliable, and that improving the retrieval component may yield more cost-effective gains in RAG system correctness than scaling the generator. |
| title | IRB: Automated Generation of Robust Factuality Benchmarks |
| topic | Information Retrieval |
| url | https://arxiv.org/abs/2602.08070 |