Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Jin, Hexi, Liu, Stephen, Li, Yuheng, Malik, Simran, Zhang, Yiying
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.16942
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912912814112768
author	Jin, Hexi Liu, Stephen Li, Yuheng Malik, Simran Zhang, Yiying
author_facet	Jin, Hexi Liu, Stephen Li, Yuheng Malik, Simran Zhang, Yiying
contents	Large language models (LLMs) increasingly answer queries by citing web sources, but existing evaluations emphasize answer correctness rather than evidence quality. We introduce SourceBench, a benchmark for measuring the quality of cited web sources across 100 real-world queries spanning informational, factual, argumentative, social, and shopping intents. SourceBench uses an eight-metric framework covering content quality (content relevance, factual accuracy, objectivity) and page-level signals (e.g., freshness, authority/accountability, clarity), and includes a human-labeled dataset with a calibrated LLM-based evaluator that matches expert judgments closely. We evaluate eight LLMs, Google Search, and three AI search tools over 3996 cited sources using SourceBench and conduct further experiments to understand the evaluation results. Overall, our work reveals four key new insights that can guide future research in the direction of GenAI and web search.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_16942
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	SourceBench: Can AI Answers Reference Quality Web Sources? Jin, Hexi Liu, Stephen Li, Yuheng Malik, Simran Zhang, Yiying Artificial Intelligence Large language models (LLMs) increasingly answer queries by citing web sources, but existing evaluations emphasize answer correctness rather than evidence quality. We introduce SourceBench, a benchmark for measuring the quality of cited web sources across 100 real-world queries spanning informational, factual, argumentative, social, and shopping intents. SourceBench uses an eight-metric framework covering content quality (content relevance, factual accuracy, objectivity) and page-level signals (e.g., freshness, authority/accountability, clarity), and includes a human-labeled dataset with a calibrated LLM-based evaluator that matches expert judgments closely. We evaluate eight LLMs, Google Search, and three AI search tools over 3996 cited sources using SourceBench and conduct further experiments to understand the evaluation results. Overall, our work reveals four key new insights that can guide future research in the direction of GenAI and web search.
title	SourceBench: Can AI Answers Reference Quality Web Sources?
topic	Artificial Intelligence
url	https://arxiv.org/abs/2602.16942

Similar Items