Saved in:
Bibliographic Details
Main Authors: Chen, Eason, Judicke, Sophia, Beigh, Kayla, Tang, Xinyi, Wang, Isabel, Yuan, Nina, Xiao, Zimo, Li, Chuangji, Li, Shizhuo, Luttmer, Reed, Singh, Shreya, Yampolsky, Maria, Parikh, Naman, Zhao, Yvonne, Chen, Meiyi, Huang, Scarlett, Mohanty, Anishka, Johnson, Gregory, Mackey, John, Lin, Jionghao, Koedinger, Ken
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.18807
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914435702980608
author Chen, Eason
Judicke, Sophia
Beigh, Kayla
Tang, Xinyi
Wang, Isabel
Yuan, Nina
Xiao, Zimo
Li, Chuangji
Li, Shizhuo
Luttmer, Reed
Singh, Shreya
Yampolsky, Maria
Parikh, Naman
Zhao, Yvonne
Chen, Meiyi
Huang, Scarlett
Mohanty, Anishka
Johnson, Gregory
Mackey, John
Lin, Jionghao
Koedinger, Ken
author_facet Chen, Eason
Judicke, Sophia
Beigh, Kayla
Tang, Xinyi
Wang, Isabel
Yuan, Nina
Xiao, Zimo
Li, Chuangji
Li, Shizhuo
Luttmer, Reed
Singh, Shreya
Yampolsky, Maria
Parikh, Naman
Zhao, Yvonne
Chen, Meiyi
Huang, Scarlett
Mohanty, Anishka
Johnson, Gregory
Mackey, John
Lin, Jionghao
Koedinger, Ken
contents We evaluate GPTutor, an LLM-powered tutoring system for an undergraduate discrete mathematics course. It integrates two LLM-supported tools: a structured proof-review tool that provides embedded feedback on students' written proof attempts, and a chatbot for math questions. In a staggered-access study with 148 students, earlier access was associated with higher homework performance during the interval when only the experimental group could use the system, while we did not observe this performance increase transfer to exam scores. Usage logs show that students with lower self-efficacy and prior exam performance used both components more frequently. Session-level behavioral labels, produced by human coding and scaled using an automated classifier, characterize how students engaged with the chatbot (e.g., answer-seeking or help-seeking). In models controlling for prior performance and self-efficacy, higher chatbot usage and answer-seeking behavior were negatively associated with subsequent midterm performance, whereas proof-review usage showed no detectable independent association. Together, the findings suggest that chatbot-based support alone may not reliably support transfer to independent assessment of math proof-learning outcomes, whereas work-anchored, structured feedback appears less associated with reduced learning.
format Preprint
id arxiv_https___arxiv_org_abs_2602_18807
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Chat-Based Support Alone May Not Be Enough: Comparing Conversational and Embedded LLM Feedback for Mathematical Proof Learning
Chen, Eason
Judicke, Sophia
Beigh, Kayla
Tang, Xinyi
Wang, Isabel
Yuan, Nina
Xiao, Zimo
Li, Chuangji
Li, Shizhuo
Luttmer, Reed
Singh, Shreya
Yampolsky, Maria
Parikh, Naman
Zhao, Yvonne
Chen, Meiyi
Huang, Scarlett
Mohanty, Anishka
Johnson, Gregory
Mackey, John
Lin, Jionghao
Koedinger, Ken
Human-Computer Interaction
Artificial Intelligence
Computers and Society
We evaluate GPTutor, an LLM-powered tutoring system for an undergraduate discrete mathematics course. It integrates two LLM-supported tools: a structured proof-review tool that provides embedded feedback on students' written proof attempts, and a chatbot for math questions. In a staggered-access study with 148 students, earlier access was associated with higher homework performance during the interval when only the experimental group could use the system, while we did not observe this performance increase transfer to exam scores. Usage logs show that students with lower self-efficacy and prior exam performance used both components more frequently. Session-level behavioral labels, produced by human coding and scaled using an automated classifier, characterize how students engaged with the chatbot (e.g., answer-seeking or help-seeking). In models controlling for prior performance and self-efficacy, higher chatbot usage and answer-seeking behavior were negatively associated with subsequent midterm performance, whereas proof-review usage showed no detectable independent association. Together, the findings suggest that chatbot-based support alone may not reliably support transfer to independent assessment of math proof-learning outcomes, whereas work-anchored, structured feedback appears less associated with reduced learning.
title Chat-Based Support Alone May Not Be Enough: Comparing Conversational and Embedded LLM Feedback for Mathematical Proof Learning
topic Human-Computer Interaction
Artificial Intelligence
Computers and Society
url https://arxiv.org/abs/2602.18807