MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Shen, Haiyang, Wang, Jiuzheng, Guo, Taian, Liu, Mugeng, Jing, Wenchun, Pan, Chongyang, Zhong, Siqi, Chen, Zhiyang, Bi, Weichen, Han, Yudong, Bai, Xiaoying, Ma, Yun
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Artificial Intelligence
Accesso online:	https://arxiv.org/abs/2605.21413
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866916035076030464
author	Shen, Haiyang Wang, Jiuzheng Guo, Taian Liu, Mugeng Jing, Wenchun Pan, Chongyang Zhong, Siqi Chen, Zhiyang Bi, Weichen Han, Yudong Bai, Xiaoying Ma, Yun
author_facet	Shen, Haiyang Wang, Jiuzheng Guo, Taian Liu, Mugeng Jing, Wenchun Pan, Chongyang Zhong, Siqi Chen, Zhiyang Bi, Weichen Han, Yudong Bai, Xiaoying Ma, Yun
contents	As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine-produced knowledge. To this end, we introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work. Students turn disciplinary knowledge into verifiable expert-level questions, review one another's designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks. This activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require. The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains. Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%, and the best-performing system, GPT-5.5, reaches a 57.58% pass rate. The failures are educationally useful because they show how fluent, source-backed answers can still miss the right query, source, term, or evidence standard. Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs. We present QuestBench as a benchmark artifact and as a reusable classroom setting for a larger educational question: how students can remain responsible knowledge actors as AI enters learning and professional work. The dataset is available at https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_21413
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work Shen, Haiyang Wang, Jiuzheng Guo, Taian Liu, Mugeng Jing, Wenchun Pan, Chongyang Zhong, Siqi Chen, Zhiyang Bi, Weichen Han, Yudong Bai, Xiaoying Ma, Yun Artificial Intelligence As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine-produced knowledge. To this end, we introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work. Students turn disciplinary knowledge into verifiable expert-level questions, review one another's designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks. This activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require. The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains. Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%, and the best-performing system, GPT-5.5, reaches a 57.58% pass rate. The failures are educationally useful because they show how fluent, source-backed answers can still miss the right query, source, term, or evidence standard. Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs. We present QuestBench as a benchmark artifact and as a reusable classroom setting for a larger educational question: how students can remain responsible knowledge actors as AI enters learning and professional work. The dataset is available at https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main.
title	Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
topic	Artificial Intelligence
url	https://arxiv.org/abs/2605.21413

Documenti analoghi