Saved in:
| Main Authors: | Salido, Eva Sánchez, Gonzalo, Julio, Marco, Guillermo |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.12896 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination
by: Salido, Eva Sánchez, et al.
Published: (2024)
by: Salido, Eva Sánchez, et al.
Published: (2024)
The Reader is the Metric: How Textual Features and Reader Profiles Explain Conflicting Evaluations of AI Creative Writing
by: Marco, Guillermo, et al.
Published: (2025)
by: Marco, Guillermo, et al.
Published: (2025)
Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation
by: Hidayat, Naila Shafirni, et al.
Published: (2025)
by: Hidayat, Naila Shafirni, et al.
Published: (2025)
Multiple-Choice Questions are Efficient and Robust LLM Evaluators
by: Zhang, Ziyin, et al.
Published: (2024)
by: Zhang, Ziyin, et al.
Published: (2024)
Beyond Memorization: Distinguishing between Reductive and Epistemic Reasoning in LLMs using Classic Logic Puzzles
by: Gabay, Adi, et al.
Published: (2026)
by: Gabay, Adi, et al.
Published: (2026)
Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning
by: Palta, Shramay, et al.
Published: (2024)
by: Palta, Shramay, et al.
Published: (2024)
None of the Above, Less of the Right: Parallel Patterns between Humans and LLMs on Multi-Choice Questions Answering
by: Tam, Zhi Rui, et al.
Published: (2025)
by: Tam, Zhi Rui, et al.
Published: (2025)
Small Language Models can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs
by: Marco, Guillermo, et al.
Published: (2024)
by: Marco, Guillermo, et al.
Published: (2024)
Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices
by: Cavalin, Paulo, et al.
Published: (2025)
by: Cavalin, Paulo, et al.
Published: (2025)
Can LLM Graph Reasoning Generalize beyond Pattern Memorization?
by: Zhang, Yizhuo, et al.
Published: (2024)
by: Zhang, Yizhuo, et al.
Published: (2024)
BOE-XSUM: Extreme Summarization in Clear Language of Spanish Legal Decrees and Notifications
by: García, Andrés Fernández, et al.
Published: (2025)
by: García, Andrés Fernández, et al.
Published: (2025)
One Size Fits None: Heuristic Collapse in LLM Investment Advice
by: Ross, Jillian, et al.
Published: (2026)
by: Ross, Jillian, et al.
Published: (2026)
Distractor Generation in Multiple-Choice Tasks: A Survey of Methods, Datasets, and Evaluation
by: Alhazmi, Elaf, et al.
Published: (2024)
by: Alhazmi, Elaf, et al.
Published: (2024)
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
by: Sun, Jiaxing, et al.
Published: (2024)
by: Sun, Jiaxing, et al.
Published: (2024)
GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams
by: Zhang, Yushun, et al.
Published: (2026)
by: Zhang, Yushun, et al.
Published: (2026)
Reasoning Models are Test Exploiters: Rethinking Multiple-Choice
by: Raman, Narun, et al.
Published: (2025)
by: Raman, Narun, et al.
Published: (2025)
Enhancing Clinical Multiple-Choice Questions Benchmarks with Knowledge Graph Guided Distractor Generation
by: Yang, Running, et al.
Published: (2025)
by: Yang, Running, et al.
Published: (2025)
Beyond Memorization: Testing LLM Reasoning on Unseen Theory of Computation Tasks
by: Shelat, Shlok, et al.
Published: (2026)
by: Shelat, Shlok, et al.
Published: (2026)
Alleviating Choice Supportive Bias in LLM with Reasoning Dependency Generation
by: Zhuang, Nan, et al.
Published: (2025)
by: Zhuang, Nan, et al.
Published: (2025)
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident, Especially When They are Wrong
by: Fu, Tairan, et al.
Published: (2025)
by: Fu, Tairan, et al.
Published: (2025)
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering
by: Molfese, Francesco Maria, et al.
Published: (2025)
by: Molfese, Francesco Maria, et al.
Published: (2025)
Reason to Rote: Rethinking Memorization in Reasoning
by: Du, Yupei, et al.
Published: (2025)
by: Du, Yupei, et al.
Published: (2025)
Test-Time Reasoners Are Strategic Multiple-Choice Test-Takers
by: Balepur, Nishant, et al.
Published: (2025)
by: Balepur, Nishant, et al.
Published: (2025)
Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs
by: Hans, Abhimanyu, et al.
Published: (2024)
by: Hans, Abhimanyu, et al.
Published: (2024)
Towards a Diagnostic and Predictive Evaluation Methodology for Sequence Labeling Tasks
by: Alvarez-Mellado, Elena, et al.
Published: (2026)
by: Alvarez-Mellado, Elena, et al.
Published: (2026)
Character-aware Transformers Learn an Irregular Morphological Pattern Yet None Generalize Like Humans
by: Ramarao, Akhilesh Kakolu, et al.
Published: (2026)
by: Ramarao, Akhilesh Kakolu, et al.
Published: (2026)
UBench: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions
by: Wang, Xunzhi, et al.
Published: (2024)
by: Wang, Xunzhi, et al.
Published: (2024)
Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing?
by: Marco, Guillermo, et al.
Published: (2024)
by: Marco, Guillermo, et al.
Published: (2024)
MatheMagic: Generating Dynamic Mathematics Benchmarks Robust to Memorization
by: O'Brien, Dayyán, et al.
Published: (2025)
by: O'Brien, Dayyán, et al.
Published: (2025)
AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment
by: Xiao, Jianfei, et al.
Published: (2026)
by: Xiao, Jianfei, et al.
Published: (2026)
Memorization or Interpolation ? Detecting LLM Memorization through Input Perturbation Analysis
by: Djiré, Albérick Euraste, et al.
Published: (2025)
by: Djiré, Albérick Euraste, et al.
Published: (2025)
Memorization or Reasoning? Exploring the Idiom Understanding of LLMs
by: Kim, Jisu, et al.
Published: (2025)
by: Kim, Jisu, et al.
Published: (2025)
On Memorization of Large Language Models in Logical Reasoning
by: Xie, Chulin, et al.
Published: (2024)
by: Xie, Chulin, et al.
Published: (2024)
LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering
by: Sutanto, Patrick, et al.
Published: (2024)
by: Sutanto, Patrick, et al.
Published: (2024)
Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options
by: Lee, Nahyun, et al.
Published: (2026)
by: Lee, Nahyun, et al.
Published: (2026)
BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks
by: Balepur, Nishant, et al.
Published: (2026)
by: Balepur, Nishant, et al.
Published: (2026)
Beyond Multiple Choice: Evaluating Steering Vectors for Summarization
by: Braun, Joschka, et al.
Published: (2025)
by: Braun, Joschka, et al.
Published: (2025)
Polyglots or Multitudes? Multilingual LLM Answers to Value-laden Multiple-Choice Questions
by: Labat, Léo, et al.
Published: (2026)
by: Labat, Léo, et al.
Published: (2026)
Data Compressibility Quantifies LLM Memorization
by: Huang, Yizhan, et al.
Published: (2025)
by: Huang, Yizhan, et al.
Published: (2025)
AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLMs Knowledge In STEM Subjects
by: Mustapha, Ahmad, et al.
Published: (2024)
by: Mustapha, Ahmad, et al.
Published: (2024)
Similar Items
-
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination
by: Salido, Eva Sánchez, et al.
Published: (2024) -
The Reader is the Metric: How Textual Features and Reader Profiles Explain Conflicting Evaluations of AI Creative Writing
by: Marco, Guillermo, et al.
Published: (2025) -
Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation
by: Hidayat, Naila Shafirni, et al.
Published: (2025) -
Multiple-Choice Questions are Efficient and Robust LLM Evaluators
by: Zhang, Ziyin, et al.
Published: (2024) -
Beyond Memorization: Distinguishing between Reductive and Epistemic Reasoning in LLMs using Classic Logic Puzzles
by: Gabay, Adi, et al.
Published: (2026)