Saved in:
| Main Authors: | Meyer, Gérôme, Breuer, Philip, Fürst, Jonathan |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.18596 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Focusing on Students, not Machines: Grounded Question Generation and Automated Answer Grading
by: Meyer, Gérôme, et al.
Published: (2025)
by: Meyer, Gérôme, et al.
Published: (2025)
Statistical Comparative Analysis of Semantic Similarities and Model Transferability Across Datasets for Short Answer Grading
by: Bonthu, Sridevi, et al.
Published: (2025)
by: Bonthu, Sridevi, et al.
Published: (2025)
Bench360: Benchmarking Local LLM Inference from 360 Degrees
by: Stuhlmann, Linus, et al.
Published: (2025)
by: Stuhlmann, Linus, et al.
Published: (2025)
When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition
by: Yao, Siyang, et al.
Published: (2026)
by: Yao, Siyang, et al.
Published: (2026)
Proving that Cryptic Crossword Clue Answers are Correct
by: Andrews, Martin, et al.
Published: (2024)
by: Andrews, Martin, et al.
Published: (2024)
When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs
by: Zagribelnyy, Bogdan, et al.
Published: (2026)
by: Zagribelnyy, Bogdan, et al.
Published: (2026)
Measuring and Reducing LLM Hallucination without Gold-Standard Answers
by: Wei, Jiaheng, et al.
Published: (2024)
by: Wei, Jiaheng, et al.
Published: (2024)
Answer Matching Outperforms Multiple Choice for Language Model Evaluation
by: Chandak, Nikhil, et al.
Published: (2025)
by: Chandak, Nikhil, et al.
Published: (2025)
Clarify, Abstain or Answer? Strategising in Conversation with Belief-Augmented Generation
by: Baan, Joris, et al.
Published: (2026)
by: Baan, Joris, et al.
Published: (2026)
SelfReflect: Can LLMs Communicate Their Internal Answer Distribution?
by: Kirchhof, Michael, et al.
Published: (2025)
by: Kirchhof, Michael, et al.
Published: (2025)
Surfacing Semantic Orthogonality Across Model Safety Benchmarks: A Multi-Dimensional Analysis
by: Bennion, Jonathan, et al.
Published: (2025)
by: Bennion, Jonathan, et al.
Published: (2025)
Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation
by: Qi, Jirui, et al.
Published: (2024)
by: Qi, Jirui, et al.
Published: (2024)
Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models
by: Jha, Abha, et al.
Published: (2026)
by: Jha, Abha, et al.
Published: (2026)
The Strongest Teacher Is Not Always the Best Teacher: Student-Centric Answer Selection
by: Hu, Zhengyu, et al.
Published: (2026)
by: Hu, Zhengyu, et al.
Published: (2026)
Train Once, Answer All: Many Pretraining Experiments for the Cost of One
by: Bordt, Sebastian, et al.
Published: (2025)
by: Bordt, Sebastian, et al.
Published: (2025)
MultiQ&A: An Analysis in Measuring Robustness via Automated Crowdsourcing of Question Perturbations and Answers
by: Cho, Nicole, et al.
Published: (2025)
by: Cho, Nicole, et al.
Published: (2025)
Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions
by: Li, Ruizhe, et al.
Published: (2024)
by: Li, Ruizhe, et al.
Published: (2024)
Explicit Diversity Conditions for Effective Question Answer Generation with Large Language Models
by: Yadav, Vikas, et al.
Published: (2024)
by: Yadav, Vikas, et al.
Published: (2024)
The Challenge of Achieving Attributability in Multilingual Table-to-Text Generation with Question-Answer Blueprints
by: Haussmann, Aden
Published: (2025)
by: Haussmann, Aden
Published: (2025)
Clinical QA 2.0: Multi-Task Learning for Answer Extraction and Categorization
by: Pattnayak, Priyaranjan, et al.
Published: (2025)
by: Pattnayak, Priyaranjan, et al.
Published: (2025)
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
by: Li, Kenneth, et al.
Published: (2023)
by: Li, Kenneth, et al.
Published: (2023)
ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge
by: Wang, Zhilin, et al.
Published: (2025)
by: Wang, Zhilin, et al.
Published: (2025)
When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment
by: Zhang, Long, et al.
Published: (2026)
by: Zhang, Long, et al.
Published: (2026)
SuRe: Summarizing Retrievals using Answer Candidates for Open-domain QA of LLMs
by: Kim, Jaehyung, et al.
Published: (2024)
by: Kim, Jaehyung, et al.
Published: (2024)
Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
by: Hammoud, Hasan Abed Al Kader, et al.
Published: (2025)
by: Hammoud, Hasan Abed Al Kader, et al.
Published: (2025)
Promote, Suppress, Iterate: How Language Models Answer One-to-Many Factual Queries
by: Yan, Tianyi Lorena, et al.
Published: (2025)
by: Yan, Tianyi Lorena, et al.
Published: (2025)
ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
by: Karger, Ezra, et al.
Published: (2024)
by: Karger, Ezra, et al.
Published: (2024)
A Careful Examination of Large Language Model Performance on Grade School Arithmetic
by: Zhang, Hugh, et al.
Published: (2024)
by: Zhang, Hugh, et al.
Published: (2024)
Beyond Answers: Transferring Reasoning Capabilities to Smaller LLMs Using Multi-Teacher Knowledge Distillation
by: Tian, Yijun, et al.
Published: (2024)
by: Tian, Yijun, et al.
Published: (2024)
MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters
by: Dada, Amin, et al.
Published: (2025)
by: Dada, Amin, et al.
Published: (2025)
From Flat to Structural: Enhancing Automated Short Answer Grading with GraphRAG
by: Chu, Yucheng, et al.
Published: (2026)
by: Chu, Yucheng, et al.
Published: (2026)
Crystal-KV: Efficient KV Cache Management for Chain-of-Thought LLMs via Answer-First Principle
by: Wang, Zihan, et al.
Published: (2026)
by: Wang, Zihan, et al.
Published: (2026)
Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data
by: Zhao, Shuai, et al.
Published: (2025)
by: Zhao, Shuai, et al.
Published: (2025)
CORI: CJKV Benchmark with Romanization Integration -- A step towards Cross-lingual Transfer Beyond Textual Scripts
by: Nguyen, Hoang H., et al.
Published: (2024)
by: Nguyen, Hoang H., et al.
Published: (2024)
Towards A Unified View of Answer Calibration for Multi-Step Reasoning
by: Deng, Shumin, et al.
Published: (2023)
by: Deng, Shumin, et al.
Published: (2023)
Implicit Probabilistic Reasoning Does Not Reflect Explicit Answers in Large Language Models
by: Mondal, Manuel, et al.
Published: (2024)
by: Mondal, Manuel, et al.
Published: (2024)
Fantastic Bugs and Where to Find Them in AI Benchmarks
by: Truong, Sang, et al.
Published: (2025)
by: Truong, Sang, et al.
Published: (2025)
Short-form Text Rewriting with Phi Silica
by: Tadimeti, Divya, et al.
Published: (2026)
by: Tadimeti, Divya, et al.
Published: (2026)
PLANET: A Collection of Benchmarks for Evaluating LLMs' Planning Capabilities
by: Li, Haoming, et al.
Published: (2025)
by: Li, Haoming, et al.
Published: (2025)
DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion
by: Lou, Yuxuan, et al.
Published: (2026)
by: Lou, Yuxuan, et al.
Published: (2026)
Similar Items
-
Focusing on Students, not Machines: Grounded Question Generation and Automated Answer Grading
by: Meyer, Gérôme, et al.
Published: (2025) -
Statistical Comparative Analysis of Semantic Similarities and Model Transferability Across Datasets for Short Answer Grading
by: Bonthu, Sridevi, et al.
Published: (2025) -
Bench360: Benchmarking Local LLM Inference from 360 Degrees
by: Stuhlmann, Linus, et al.
Published: (2025) -
When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition
by: Yao, Siyang, et al.
Published: (2026) -
Proving that Cryptic Crossword Clue Answers are Correct
by: Andrews, Martin, et al.
Published: (2024)