:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Moore, Steven, Costello, Eamon, Nguyen, Huy A., Stamper, John
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2405.20529
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Automated Generation and Tagging of Knowledge Components from Multiple-Choice Questions
by: Moore, Steven, et al.
Published: (2024)

Cognitive Agent Compilation for Explicit Problem Solver Modeling
by: Moon, Hyeongdon, et al.
Published: (2026)

Automatic Evaluation of Healthcare LLMs Beyond Question-Answering
by: Arias-Duart, Anna, et al.
Published: (2025)

Small but Significant: On the Promise of Small Language Models for Accessible AIED
by: Wei, Yumou, et al.
Published: (2025)

Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation
by: Bohnet, Bernd, et al.
Published: (2024)

AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs
by: Siro, Clemencia, et al.
Published: (2024)

Automatic Generation of Inference Making Questions for Reading Comprehension Assessments
by: Ma, Wanjing Anya, et al.
Published: (2025)

Automatic Dataset Generation for Knowledge Intensive Question Answering Tasks
by: Yuen, Sizhe, et al.
Published: (2025)

Hallucination-Free Automatic Question & Answer Generation for Intuitive Learning
by: Wang, Nicholas X., et al.
Published: (2026)

Automatic Question & Answer Generation Using Generative Large Language Model (LLM)
by: Ehsan, Md. Alvee, et al.
Published: (2025)

WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models
by: Gupta, Prannaya, et al.
Published: (2024)

SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence
by: Wang, Yiheng, et al.
Published: (2025)

Are Large Language Models Consistent over Value-laden Questions?
by: Moore, Jared, et al.
Published: (2024)

Confabulation: The Surprising Value of Large Language Model Hallucinations
by: Sui, Peiqi, et al.
Published: (2024)

Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering
by: Ngo, Nghia Trung, et al.
Published: (2024)

Bias in the Tails: How Name-conditioned Evaluative Framing in Resume Summaries Destabilizes LLM-based Hiring
by: Nghiem, Huy, et al.
Published: (2026)

CompeteSMoE -- Statistically Guaranteed Mixture of Experts Training via Competition
by: Nguyen, Nam V., et al.
Published: (2025)

Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment
by: Nghiem, Huy, et al.
Published: (2025)

From Answers to Questions: EQGBench for Evaluating LLMs' Educational Question Generation
by: Zhou, Chengliang, et al.
Published: (2025)

R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models
by: Tu, Shangqing, et al.
Published: (2024)

Automatic Legal Writing Evaluation of LLMs
by: Pires, Ramon, et al.
Published: (2025)

Sketch: A Toolkit for Streamlining LLM Operations
by: Jiang, Xin, et al.
Published: (2024)

The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory
by: Schmucker, Robin, et al.
Published: (2025)

Automatic Generation of Question Hints for Mathematics Problems using Large Language Models in Educational Technology
by: Tonga, Junior Cedric, et al.
Published: (2024)

Training Computer Use Agents to Assess the Usability of Graphical User Interfaces
by: Gao, Alice, et al.
Published: (2026)

Adaptive Stopping for Multi-Turn LLM Reasoning
by: Zhou, Xiaofan, et al.
Published: (2026)

Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains
by: Xu, Austin, et al.
Published: (2025)

VLSP 2025 MLQA-TSR Challenge: Vietnamese Multimodal Legal Question Answering on Traffic Sign Regulation
by: Luu, Son T., et al.
Published: (2025)

Feedback Forensics: A Toolkit to Measure AI Personality
by: Findeis, Arduin, et al.
Published: (2025)

VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering
by: Nguyen, Tan-Minh, et al.
Published: (2025)

Evaluating the Fitness of Ontologies for the Task of Question Generation
by: Alkhuzaey, Samah, et al.
Published: (2025)

Qworld: Question-Specific Evaluation Criteria for LLMs
by: Gao, Shanghua, et al.
Published: (2026)

Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore's languages
by: Nguyen, Tuan, et al.
Published: (2025)

Cross-Attention Watermarking of Large Language Models
by: Baldassini, Folco Bertini, et al.
Published: (2024)

Jury: A Comprehensive Evaluation Toolkit
by: Cavusoglu, Devrim, et al.
Published: (2023)

Towards Automatic Evaluation of Task-Oriented Dialogue Flows
by: Mirtaheri, Mehrnoosh, et al.
Published: (2024)

Submodular Evaluation Subset Selection in Automatic Prompt Optimization
by: Nian, Jinming, et al.
Published: (2026)

LMFlow: An Extensible Toolkit for Finetuning and Inference of Large Foundation Models
by: Diao, Shizhe, et al.
Published: (2023)

JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
by: Ran, Delong, et al.
Published: (2024)

GR-NLP-TOOLKIT: An Open-Source NLP Toolkit for Modern Greek
by: Loukas, Lefteris, et al.
Published: (2024)