Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Ognibene, Dimitri, Donabauer, Gregor, Theophilou, Emily, Koyuturk, Cansu, Yavari, Mona, Bursic, Sathya, Telari, Alessia, Testa, Alessia, Boiano, Raffaele, Taibi, Davide, Hernandez-Leo, Davinia, Kruschwitz, Udo, Ruskov, Martin
Format: Preprint
Veröffentlicht: 2025
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2503.02532
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866915182282801152
author Ognibene, Dimitri
Donabauer, Gregor
Theophilou, Emily
Koyuturk, Cansu
Yavari, Mona
Bursic, Sathya
Telari, Alessia
Testa, Alessia
Boiano, Raffaele
Taibi, Davide
Hernandez-Leo, Davinia
Kruschwitz, Udo
Ruskov, Martin
author_facet Ognibene, Dimitri
Donabauer, Gregor
Theophilou, Emily
Koyuturk, Cansu
Yavari, Mona
Bursic, Sathya
Telari, Alessia
Testa, Alessia
Boiano, Raffaele
Taibi, Davide
Hernandez-Leo, Davinia
Kruschwitz, Udo
Ruskov, Martin
contents The use of large language model (LLM)-powered chatbots, such as ChatGPT, has become popular across various domains, supporting a range of tasks and processes. However, due to the intrinsic complexity of LLMs, effective prompting is more challenging than it may seem. This highlights the need for innovative educational and support strategies that are both widely accessible and seamlessly integrated into task workflows. Yet, LLM prompting is highly task- and domain-dependent, limiting the effectiveness of generic approaches. In this study, we explore whether LLM-based methods can facilitate learning assessments by using ad-hoc guidelines and a minimal number of annotated prompt samples. Our framework transforms these guidelines into features that can be identified within learners' prompts. Using these feature descriptions and annotated examples, we create few-shot learning detectors. We then evaluate different configurations of these detectors, testing three state-of-the-art LLMs and ensembles. We run experiments with cross-validation on a sample of original prompts, as well as tests on prompts collected from task-naive learners. Our results show how LLMs perform on feature detection. Notably, GPT- 4 demonstrates strong performance on most features, while closely related models, such as GPT-3 and GPT-3.5 Turbo (Instruct), show inconsistent behaviors in feature classification. These differences highlight the need for further research into how design choices impact feature selection and prompt detection. Our findings contribute to the fields of generative AI literacy and computer-supported learning assessment, offering valuable insights for both researchers and practitioners.
format Preprint
id arxiv_https___arxiv_org_abs_2503_02532
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Use Me Wisely: AI-Driven Assessment for LLM Prompting Skills Development
Ognibene, Dimitri
Donabauer, Gregor
Theophilou, Emily
Koyuturk, Cansu
Yavari, Mona
Bursic, Sathya
Telari, Alessia
Testa, Alessia
Boiano, Raffaele
Taibi, Davide
Hernandez-Leo, Davinia
Kruschwitz, Udo
Ruskov, Martin
Computers and Society
Computation and Language
The use of large language model (LLM)-powered chatbots, such as ChatGPT, has become popular across various domains, supporting a range of tasks and processes. However, due to the intrinsic complexity of LLMs, effective prompting is more challenging than it may seem. This highlights the need for innovative educational and support strategies that are both widely accessible and seamlessly integrated into task workflows. Yet, LLM prompting is highly task- and domain-dependent, limiting the effectiveness of generic approaches. In this study, we explore whether LLM-based methods can facilitate learning assessments by using ad-hoc guidelines and a minimal number of annotated prompt samples. Our framework transforms these guidelines into features that can be identified within learners' prompts. Using these feature descriptions and annotated examples, we create few-shot learning detectors. We then evaluate different configurations of these detectors, testing three state-of-the-art LLMs and ensembles. We run experiments with cross-validation on a sample of original prompts, as well as tests on prompts collected from task-naive learners. Our results show how LLMs perform on feature detection. Notably, GPT- 4 demonstrates strong performance on most features, while closely related models, such as GPT-3 and GPT-3.5 Turbo (Instruct), show inconsistent behaviors in feature classification. These differences highlight the need for further research into how design choices impact feature selection and prompt detection. Our findings contribute to the fields of generative AI literacy and computer-supported learning assessment, offering valuable insights for both researchers and practitioners.
title Use Me Wisely: AI-Driven Assessment for LLM Prompting Skills Development
topic Computers and Society
Computation and Language
url https://arxiv.org/abs/2503.02532