Saved in:
Bibliographic Details
Main Authors: Jin, Hexi, Liu, Stephen, Li, Yuheng, Malik, Simran, Zhang, Yiying
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.16942
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912912814112768
author Jin, Hexi
Liu, Stephen
Li, Yuheng
Malik, Simran
Zhang, Yiying
author_facet Jin, Hexi
Liu, Stephen
Li, Yuheng
Malik, Simran
Zhang, Yiying
contents Large language models (LLMs) increasingly answer queries by citing web sources, but existing evaluations emphasize answer correctness rather than evidence quality. We introduce SourceBench, a benchmark for measuring the quality of cited web sources across 100 real-world queries spanning informational, factual, argumentative, social, and shopping intents. SourceBench uses an eight-metric framework covering content quality (content relevance, factual accuracy, objectivity) and page-level signals (e.g., freshness, authority/accountability, clarity), and includes a human-labeled dataset with a calibrated LLM-based evaluator that matches expert judgments closely. We evaluate eight LLMs, Google Search, and three AI search tools over 3996 cited sources using SourceBench and conduct further experiments to understand the evaluation results. Overall, our work reveals four key new insights that can guide future research in the direction of GenAI and web search.
format Preprint
id arxiv_https___arxiv_org_abs_2602_16942
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle SourceBench: Can AI Answers Reference Quality Web Sources?
Jin, Hexi
Liu, Stephen
Li, Yuheng
Malik, Simran
Zhang, Yiying
Artificial Intelligence
Large language models (LLMs) increasingly answer queries by citing web sources, but existing evaluations emphasize answer correctness rather than evidence quality. We introduce SourceBench, a benchmark for measuring the quality of cited web sources across 100 real-world queries spanning informational, factual, argumentative, social, and shopping intents. SourceBench uses an eight-metric framework covering content quality (content relevance, factual accuracy, objectivity) and page-level signals (e.g., freshness, authority/accountability, clarity), and includes a human-labeled dataset with a calibrated LLM-based evaluator that matches expert judgments closely. We evaluate eight LLMs, Google Search, and three AI search tools over 3996 cited sources using SourceBench and conduct further experiments to understand the evaluation results. Overall, our work reveals four key new insights that can guide future research in the direction of GenAI and web search.
title SourceBench: Can AI Answers Reference Quality Web Sources?
topic Artificial Intelligence
url https://arxiv.org/abs/2602.16942