Salvato in:
Dettagli Bibliografici
Autori principali: Shen, Haiyang, Wang, Jiuzheng, Guo, Taian, Liu, Mugeng, Jing, Wenchun, Pan, Chongyang, Zhong, Siqi, Chen, Zhiyang, Bi, Weichen, Han, Yudong, Bai, Xiaoying, Ma, Yun
Natura: Preprint
Pubblicazione: 2026
Soggetti:
Accesso online:https://arxiv.org/abs/2605.21413
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866916035076030464
author Shen, Haiyang
Wang, Jiuzheng
Guo, Taian
Liu, Mugeng
Jing, Wenchun
Pan, Chongyang
Zhong, Siqi
Chen, Zhiyang
Bi, Weichen
Han, Yudong
Bai, Xiaoying
Ma, Yun
author_facet Shen, Haiyang
Wang, Jiuzheng
Guo, Taian
Liu, Mugeng
Jing, Wenchun
Pan, Chongyang
Zhong, Siqi
Chen, Zhiyang
Bi, Weichen
Han, Yudong
Bai, Xiaoying
Ma, Yun
contents As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine-produced knowledge. To this end, we introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work. Students turn disciplinary knowledge into verifiable expert-level questions, review one another's designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks. This activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require. The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains. Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%, and the best-performing system, GPT-5.5, reaches a 57.58% pass rate. The failures are educationally useful because they show how fluent, source-backed answers can still miss the right query, source, term, or evidence standard. Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs. We present QuestBench as a benchmark artifact and as a reusable classroom setting for a larger educational question: how students can remain responsible knowledge actors as AI enters learning and professional work. The dataset is available at https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main.
format Preprint
id arxiv_https___arxiv_org_abs_2605_21413
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
Shen, Haiyang
Wang, Jiuzheng
Guo, Taian
Liu, Mugeng
Jing, Wenchun
Pan, Chongyang
Zhong, Siqi
Chen, Zhiyang
Bi, Weichen
Han, Yudong
Bai, Xiaoying
Ma, Yun
Artificial Intelligence
As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine-produced knowledge. To this end, we introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work. Students turn disciplinary knowledge into verifiable expert-level questions, review one another's designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks. This activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require. The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains. Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%, and the best-performing system, GPT-5.5, reaches a 57.58% pass rate. The failures are educationally useful because they show how fluent, source-backed answers can still miss the right query, source, term, or evidence standard. Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs. We present QuestBench as a benchmark artifact and as a reusable classroom setting for a larger educational question: how students can remain responsible knowledge actors as AI enters learning and professional work. The dataset is available at https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main.
title Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
topic Artificial Intelligence
url https://arxiv.org/abs/2605.21413