:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Meyer, Gérôme, Breuer, Philip, Fürst, Jonathan
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2409.18596
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Focusing on Students, not Machines: Grounded Question Generation and Automated Answer Grading
by: Meyer, Gérôme, et al.
Published: (2025)

Statistical Comparative Analysis of Semantic Similarities and Model Transferability Across Datasets for Short Answer Grading
by: Bonthu, Sridevi, et al.
Published: (2025)

Bench360: Benchmarking Local LLM Inference from 360 Degrees
by: Stuhlmann, Linus, et al.
Published: (2025)

When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition
by: Yao, Siyang, et al.
Published: (2026)

Proving that Cryptic Crossword Clue Answers are Correct
by: Andrews, Martin, et al.
Published: (2024)

When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs
by: Zagribelnyy, Bogdan, et al.
Published: (2026)

Measuring and Reducing LLM Hallucination without Gold-Standard Answers
by: Wei, Jiaheng, et al.
Published: (2024)

Answer Matching Outperforms Multiple Choice for Language Model Evaluation
by: Chandak, Nikhil, et al.
Published: (2025)

Clarify, Abstain or Answer? Strategising in Conversation with Belief-Augmented Generation
by: Baan, Joris, et al.
Published: (2026)

SelfReflect: Can LLMs Communicate Their Internal Answer Distribution?
by: Kirchhof, Michael, et al.
Published: (2025)

Surfacing Semantic Orthogonality Across Model Safety Benchmarks: A Multi-Dimensional Analysis
by: Bennion, Jonathan, et al.
Published: (2025)

Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation
by: Qi, Jirui, et al.
Published: (2024)

Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models
by: Jha, Abha, et al.
Published: (2026)

The Strongest Teacher Is Not Always the Best Teacher: Student-Centric Answer Selection
by: Hu, Zhengyu, et al.
Published: (2026)

Train Once, Answer All: Many Pretraining Experiments for the Cost of One
by: Bordt, Sebastian, et al.
Published: (2025)

MultiQ&A: An Analysis in Measuring Robustness via Automated Crowdsourcing of Question Perturbations and Answers
by: Cho, Nicole, et al.
Published: (2025)

Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions
by: Li, Ruizhe, et al.
Published: (2024)

Explicit Diversity Conditions for Effective Question Answer Generation with Large Language Models
by: Yadav, Vikas, et al.
Published: (2024)

The Challenge of Achieving Attributability in Multilingual Table-to-Text Generation with Question-Answer Blueprints
by: Haussmann, Aden
Published: (2025)

Clinical QA 2.0: Multi-Task Learning for Answer Extraction and Categorization
by: Pattnayak, Priyaranjan, et al.
Published: (2025)

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
by: Li, Kenneth, et al.
Published: (2023)

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge
by: Wang, Zhilin, et al.
Published: (2025)

When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment
by: Zhang, Long, et al.
Published: (2026)

SuRe: Summarizing Retrievals using Answer Candidates for Open-domain QA of LLMs
by: Kim, Jaehyung, et al.
Published: (2024)

Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
by: Hammoud, Hasan Abed Al Kader, et al.
Published: (2025)

Promote, Suppress, Iterate: How Language Models Answer One-to-Many Factual Queries
by: Yan, Tianyi Lorena, et al.
Published: (2025)

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
by: Karger, Ezra, et al.
Published: (2024)

A Careful Examination of Large Language Model Performance on Grade School Arithmetic
by: Zhang, Hugh, et al.
Published: (2024)

Beyond Answers: Transferring Reasoning Capabilities to Smaller LLMs Using Multi-Teacher Knowledge Distillation
by: Tian, Yijun, et al.
Published: (2024)

MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters
by: Dada, Amin, et al.
Published: (2025)

From Flat to Structural: Enhancing Automated Short Answer Grading with GraphRAG
by: Chu, Yucheng, et al.
Published: (2026)

Crystal-KV: Efficient KV Cache Management for Chain-of-Thought LLMs via Answer-First Principle
by: Wang, Zihan, et al.
Published: (2026)

Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data
by: Zhao, Shuai, et al.
Published: (2025)

CORI: CJKV Benchmark with Romanization Integration -- A step towards Cross-lingual Transfer Beyond Textual Scripts
by: Nguyen, Hoang H., et al.
Published: (2024)

Towards A Unified View of Answer Calibration for Multi-Step Reasoning
by: Deng, Shumin, et al.
Published: (2023)

Implicit Probabilistic Reasoning Does Not Reflect Explicit Answers in Large Language Models
by: Mondal, Manuel, et al.
Published: (2024)

Fantastic Bugs and Where to Find Them in AI Benchmarks
by: Truong, Sang, et al.
Published: (2025)

Short-form Text Rewriting with Phi Silica
by: Tadimeti, Divya, et al.
Published: (2026)

PLANET: A Collection of Benchmarks for Evaluating LLMs' Planning Capabilities
by: Li, Haoming, et al.
Published: (2025)

DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion
by: Lou, Yuxuan, et al.
Published: (2026)