:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Guo, Jing, Li, Nan, Xu, Ming
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Information Retrieval
Online Access:	https://arxiv.org/abs/2501.06277
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

DomainRAG: A Chinese Benchmark for Evaluating Domain-specific Retrieval-Augmented Generation
by: Wang, Shuting, et al.
Published: (2024)

Benchmarking large language models for biomedical natural language processing applications and recommendations
by: Chen, Qingyu, et al.
Published: (2023)

Generating clickbait spoilers with an ensemble of large language models
by: Woźny, Mateusz, et al.
Published: (2024)

IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval
by: Song, Tingyu, et al.
Published: (2025)

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework
by: Zhu, Kunlun, et al.
Published: (2024)

Generating Diverse Q&A Benchmarks for RAG Evaluation with DataMorgana
by: Filice, Simone, et al.
Published: (2025)

Evaluating the Robustness of Retrieval-Augmented Generation to Adversarial Evidence in the Health Domain
by: Amirshahi, Shakiba, et al.
Published: (2025)

Multilingual and Domain-Agnostic Tip-of-the-Tongue Query Generation for Simulated Evaluation
by: He, Xuhong, et al.
Published: (2026)

A systematic review of geospatial location embedding approaches in large language models: A path to spatial AI systems
by: Tucker, Sean
Published: (2024)

Had enough of experts? Quantitative knowledge retrieval from large language models
by: Selby, David, et al.
Published: (2024)

Extracting chemical food safety hazards from the scientific literature automatically using large language models
by: Özen, Neris, et al.
Published: (2024)

LegalAgentBench: Evaluating LLM Agents in Legal Domain
by: Li, Haitao, et al.
Published: (2024)

Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation
by: Liu, Qijiong, et al.
Published: (2025)

Building Russian Benchmark for Evaluation of Information Retrieval Models
by: Kovalev, Grigory, et al.
Published: (2025)

High-performance automated abstract screening with large language model ensembles
by: Sanghera, Rohan, et al.
Published: (2024)

SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation
by: Su, Weihang, et al.
Published: (2025)

On the Evaluation of Machine-Generated Reports
by: Mayfield, James, et al.
Published: (2024)

BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent
by: Chen, Zijian, et al.
Published: (2025)

BioPulse-QA: A Dynamic Biomedical Question-Answering Benchmark for Evaluating Factuality, Robustness, and Bias in Large Language Models
by: Bhattarai, Kriti, et al.
Published: (2026)

Navigating Through Paper Flood: Advancing LLM-based Paper Evaluation through Domain-Aware Retrieval and Latent Reasoning
by: Zheng, Wuqiang, et al.
Published: (2025)

Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI
by: Singh, Saurabh K., et al.
Published: (2026)

Answering real-world clinical questions using large language model based systems
by: Low, Yen Sia, et al.
Published: (2024)

On Synthetic Data Strategies for Domain-Specific Generative Retrieval
by: Wen, Haoyang, et al.
Published: (2025)

Towards Personalized Deep Research: Benchmarks and Evaluations
by: Liang, Yuan, et al.
Published: (2025)

Evaluating Generative Ad Hoc Information Retrieval
by: Gienapp, Lukas, et al.
Published: (2023)

Evaluating Retrieval Quality in Retrieval-Augmented Generation
by: Salemi, Alireza, et al.
Published: (2024)

Evaluating Factual Density in Multi-Source RAG: A Study in Medical AI Accuracy
by: DeMarco, Michael R.
Published: (2026)

Evaluating Robustness of Generative Search Engine on Adversarial Factual Questions
by: Hu, Xuming, et al.
Published: (2024)

HintEval: A Comprehensive Framework for Hint Generation and Evaluation for Questions
by: Mozafari, Jamshid, et al.
Published: (2025)

A Benchmark for Open-Domain Numerical Fact-Checking Enhanced by Claim Decomposition
by: Venktesh, V, et al.
Published: (2025)

PARSE: An Open-Domain Reasoning Question Answering Benchmark for Persian
by: Mozafari, Jamshid, et al.
Published: (2026)

Large language models are good medical coders, if provided with tools
by: Kwan, Keith
Published: (2024)

RAGVUE: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation
by: Murugaraj, Keerthana, et al.
Published: (2025)

Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing
by: Song, Tingyu, et al.
Published: (2026)

DoGMaTiQ: Automated Generation of Question-and-Answer Nuggets for Report Evaluation
by: Li, Bryan, et al.
Published: (2026)

AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark
by: Chen, Jianlyu, et al.
Published: (2024)

Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks
by: Kuissi, Nathan, et al.
Published: (2026)

REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering
by: Wang, Yuhao, et al.
Published: (2024)

Can Synthetic Query Rewrites Capture User Intent Better than Humans in Retrieval-Augmented Generation?
by: Zheng, JiaYing, et al.
Published: (2025)

Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration
by: Dai, Sunhao, et al.
Published: (2024)