Saved in:
| Main Authors: | Guo, Jing, Li, Nan, Xu, Ming |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2501.06277 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
DomainRAG: A Chinese Benchmark for Evaluating Domain-specific Retrieval-Augmented Generation
by: Wang, Shuting, et al.
Published: (2024)
by: Wang, Shuting, et al.
Published: (2024)
Benchmarking large language models for biomedical natural language processing applications and recommendations
by: Chen, Qingyu, et al.
Published: (2023)
by: Chen, Qingyu, et al.
Published: (2023)
Generating clickbait spoilers with an ensemble of large language models
by: Woźny, Mateusz, et al.
Published: (2024)
by: Woźny, Mateusz, et al.
Published: (2024)
IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval
by: Song, Tingyu, et al.
Published: (2025)
by: Song, Tingyu, et al.
Published: (2025)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework
by: Zhu, Kunlun, et al.
Published: (2024)
by: Zhu, Kunlun, et al.
Published: (2024)
Generating Diverse Q&A Benchmarks for RAG Evaluation with DataMorgana
by: Filice, Simone, et al.
Published: (2025)
by: Filice, Simone, et al.
Published: (2025)
Evaluating the Robustness of Retrieval-Augmented Generation to Adversarial Evidence in the Health Domain
by: Amirshahi, Shakiba, et al.
Published: (2025)
by: Amirshahi, Shakiba, et al.
Published: (2025)
Multilingual and Domain-Agnostic Tip-of-the-Tongue Query Generation for Simulated Evaluation
by: He, Xuhong, et al.
Published: (2026)
by: He, Xuhong, et al.
Published: (2026)
A systematic review of geospatial location embedding approaches in large language models: A path to spatial AI systems
by: Tucker, Sean
Published: (2024)
by: Tucker, Sean
Published: (2024)
Had enough of experts? Quantitative knowledge retrieval from large language models
by: Selby, David, et al.
Published: (2024)
by: Selby, David, et al.
Published: (2024)
Extracting chemical food safety hazards from the scientific literature automatically using large language models
by: Özen, Neris, et al.
Published: (2024)
by: Özen, Neris, et al.
Published: (2024)
LegalAgentBench: Evaluating LLM Agents in Legal Domain
by: Li, Haitao, et al.
Published: (2024)
by: Li, Haitao, et al.
Published: (2024)
Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation
by: Liu, Qijiong, et al.
Published: (2025)
by: Liu, Qijiong, et al.
Published: (2025)
Building Russian Benchmark for Evaluation of Information Retrieval Models
by: Kovalev, Grigory, et al.
Published: (2025)
by: Kovalev, Grigory, et al.
Published: (2025)
High-performance automated abstract screening with large language model ensembles
by: Sanghera, Rohan, et al.
Published: (2024)
by: Sanghera, Rohan, et al.
Published: (2024)
SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation
by: Su, Weihang, et al.
Published: (2025)
by: Su, Weihang, et al.
Published: (2025)
On the Evaluation of Machine-Generated Reports
by: Mayfield, James, et al.
Published: (2024)
by: Mayfield, James, et al.
Published: (2024)
BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent
by: Chen, Zijian, et al.
Published: (2025)
by: Chen, Zijian, et al.
Published: (2025)
BioPulse-QA: A Dynamic Biomedical Question-Answering Benchmark for Evaluating Factuality, Robustness, and Bias in Large Language Models
by: Bhattarai, Kriti, et al.
Published: (2026)
by: Bhattarai, Kriti, et al.
Published: (2026)
Navigating Through Paper Flood: Advancing LLM-based Paper Evaluation through Domain-Aware Retrieval and Latent Reasoning
by: Zheng, Wuqiang, et al.
Published: (2025)
by: Zheng, Wuqiang, et al.
Published: (2025)
Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI
by: Singh, Saurabh K., et al.
Published: (2026)
by: Singh, Saurabh K., et al.
Published: (2026)
Answering real-world clinical questions using large language model based systems
by: Low, Yen Sia, et al.
Published: (2024)
by: Low, Yen Sia, et al.
Published: (2024)
On Synthetic Data Strategies for Domain-Specific Generative Retrieval
by: Wen, Haoyang, et al.
Published: (2025)
by: Wen, Haoyang, et al.
Published: (2025)
Towards Personalized Deep Research: Benchmarks and Evaluations
by: Liang, Yuan, et al.
Published: (2025)
by: Liang, Yuan, et al.
Published: (2025)
Evaluating Generative Ad Hoc Information Retrieval
by: Gienapp, Lukas, et al.
Published: (2023)
by: Gienapp, Lukas, et al.
Published: (2023)
Evaluating Retrieval Quality in Retrieval-Augmented Generation
by: Salemi, Alireza, et al.
Published: (2024)
by: Salemi, Alireza, et al.
Published: (2024)
Evaluating Factual Density in Multi-Source RAG: A Study in Medical AI Accuracy
by: DeMarco, Michael R.
Published: (2026)
by: DeMarco, Michael R.
Published: (2026)
Evaluating Robustness of Generative Search Engine on Adversarial Factual Questions
by: Hu, Xuming, et al.
Published: (2024)
by: Hu, Xuming, et al.
Published: (2024)
HintEval: A Comprehensive Framework for Hint Generation and Evaluation for Questions
by: Mozafari, Jamshid, et al.
Published: (2025)
by: Mozafari, Jamshid, et al.
Published: (2025)
A Benchmark for Open-Domain Numerical Fact-Checking Enhanced by Claim Decomposition
by: Venktesh, V, et al.
Published: (2025)
by: Venktesh, V, et al.
Published: (2025)
PARSE: An Open-Domain Reasoning Question Answering Benchmark for Persian
by: Mozafari, Jamshid, et al.
Published: (2026)
by: Mozafari, Jamshid, et al.
Published: (2026)
Large language models are good medical coders, if provided with tools
by: Kwan, Keith
Published: (2024)
by: Kwan, Keith
Published: (2024)
RAGVUE: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation
by: Murugaraj, Keerthana, et al.
Published: (2025)
by: Murugaraj, Keerthana, et al.
Published: (2025)
Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing
by: Song, Tingyu, et al.
Published: (2026)
by: Song, Tingyu, et al.
Published: (2026)
DoGMaTiQ: Automated Generation of Question-and-Answer Nuggets for Report Evaluation
by: Li, Bryan, et al.
Published: (2026)
by: Li, Bryan, et al.
Published: (2026)
AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark
by: Chen, Jianlyu, et al.
Published: (2024)
by: Chen, Jianlyu, et al.
Published: (2024)
Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks
by: Kuissi, Nathan, et al.
Published: (2026)
by: Kuissi, Nathan, et al.
Published: (2026)
REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering
by: Wang, Yuhao, et al.
Published: (2024)
by: Wang, Yuhao, et al.
Published: (2024)
Can Synthetic Query Rewrites Capture User Intent Better than Humans in Retrieval-Augmented Generation?
by: Zheng, JiaYing, et al.
Published: (2025)
by: Zheng, JiaYing, et al.
Published: (2025)
Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration
by: Dai, Sunhao, et al.
Published: (2024)
by: Dai, Sunhao, et al.
Published: (2024)
Similar Items
-
DomainRAG: A Chinese Benchmark for Evaluating Domain-specific Retrieval-Augmented Generation
by: Wang, Shuting, et al.
Published: (2024) -
Benchmarking large language models for biomedical natural language processing applications and recommendations
by: Chen, Qingyu, et al.
Published: (2023) -
Generating clickbait spoilers with an ensemble of large language models
by: Woźny, Mateusz, et al.
Published: (2024) -
IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval
by: Song, Tingyu, et al.
Published: (2025) -
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework
by: Zhu, Kunlun, et al.
Published: (2024)