Saved in:
| Main Author: | D'addario, Andrew Maranhão Ventura |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.21578 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Medical Malice: A Dataset for Context-Aware Safety in Healthcare LLMs
by: D'addario, Andrew Maranhão Ventura
Published: (2025)
by: D'addario, Andrew Maranhão Ventura
Published: (2025)
FanOutQA: A Multi-Hop, Multi-Document Question Answering Benchmark for Large Language Models
by: Zhu, Andrew, et al.
Published: (2024)
by: Zhu, Andrew, et al.
Published: (2024)
DateLogicQA: Benchmarking Temporal Biases in Large Language Models
by: Bhatia, Gagan, et al.
Published: (2024)
by: Bhatia, Gagan, et al.
Published: (2024)
DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation
by: Doris, Anna C., et al.
Published: (2024)
by: Doris, Anna C., et al.
Published: (2024)
Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models
by: Tan, Yingshui, et al.
Published: (2024)
by: Tan, Yingshui, et al.
Published: (2024)
Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark
by: Mastrokostas, Charalampos, et al.
Published: (2026)
by: Mastrokostas, Charalampos, et al.
Published: (2026)
Benchmarking the Pedagogical Knowledge of Large Language Models
by: Lelièvre, Maxime, et al.
Published: (2025)
by: Lelièvre, Maxime, et al.
Published: (2025)
MizanQA: Benchmarking Large Language Models on Moroccan Legal Question Answering
by: Bahaj, Adil, et al.
Published: (2025)
by: Bahaj, Adil, et al.
Published: (2025)
RelationalFactQA: A Benchmark for Evaluating Tabular Fact Retrieval from Large Language Models
by: Satriani, Dario, et al.
Published: (2025)
by: Satriani, Dario, et al.
Published: (2025)
Uncovering Competency Gaps in Large Language Models and Their Benchmarks
by: Bohacek, Maty, et al.
Published: (2025)
by: Bohacek, Maty, et al.
Published: (2025)
ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models
by: Düzkar, Kemal
Published: (2026)
by: Düzkar, Kemal
Published: (2026)
IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property
by: Wang, Qiyao, et al.
Published: (2025)
by: Wang, Qiyao, et al.
Published: (2025)
Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models
by: Li, Haoyang, et al.
Published: (2025)
by: Li, Haoyang, et al.
Published: (2025)
A Women's Health Benchmark for Large Language Models
by: Gruber, Victoria-Elisabeth, et al.
Published: (2025)
by: Gruber, Victoria-Elisabeth, et al.
Published: (2025)
Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts
by: Xie, Jian, et al.
Published: (2023)
by: Xie, Jian, et al.
Published: (2023)
BR-TaxQA-R: A Dataset for Question Answering with References for Brazilian Personal Income Tax Law, including case law
by: Júnior, Juvenal Domingos, et al.
Published: (2025)
by: Júnior, Juvenal Domingos, et al.
Published: (2025)
Can Large Language Models Derive New Knowledge? A Dynamic Benchmark for Biological Knowledge Discovery
by: Yang, Chaoqun, et al.
Published: (2026)
by: Yang, Chaoqun, et al.
Published: (2026)
EnviroExam: Benchmarking Environmental Science Knowledge of Large Language Models
by: Huang, Yu, et al.
Published: (2024)
by: Huang, Yu, et al.
Published: (2024)
Neural Incompatibility: The Unbridgeable Gap of Cross-Scale Parametric Knowledge Transfer in Large Language Models
by: Tan, Yuqiao, et al.
Published: (2025)
by: Tan, Yuqiao, et al.
Published: (2025)
Retrieval-Constrained Decoding Reveals Underestimated Parametric Knowledge in Language Models
by: Hamdani, Rajaa El, et al.
Published: (2025)
by: Hamdani, Rajaa El, et al.
Published: (2025)
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards
by: Alzahrani, Norah, et al.
Published: (2024)
by: Alzahrani, Norah, et al.
Published: (2024)
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models
by: Li, Chuhan, et al.
Published: (2024)
by: Li, Chuhan, et al.
Published: (2024)
Self-Prompting Large Language Models for Zero-Shot Open-Domain QA
by: Li, Junlong, et al.
Published: (2022)
by: Li, Junlong, et al.
Published: (2022)
CriticEval: Evaluating Large Language Model as Critic
by: Lan, Tian, et al.
Published: (2024)
by: Lan, Tian, et al.
Published: (2024)
Retrieval Augmented Generation-based Large Language Models for Bridging Transportation Cybersecurity Legal Knowledge Gaps
by: Akbar, Khandakar Ashrafi, et al.
Published: (2025)
by: Akbar, Khandakar Ashrafi, et al.
Published: (2025)
Assessing The Potential Of Mid-Sized Language Models For Clinical QA
by: Bolton, Elliot, et al.
Published: (2024)
by: Bolton, Elliot, et al.
Published: (2024)
ECG-Expert-QA: A Benchmark for Evaluating Medical Large Language Models in Heart Disease Diagnosis
by: Wang, Xu, et al.
Published: (2025)
by: Wang, Xu, et al.
Published: (2025)
VLKEB: A Large Vision-Language Model Knowledge Editing Benchmark
by: Huang, Han, et al.
Published: (2024)
by: Huang, Han, et al.
Published: (2024)
Closing the Confidence-Faithfulness Gap in Large Language Models
by: Miao, Miranda Muqing, et al.
Published: (2026)
by: Miao, Miranda Muqing, et al.
Published: (2026)
GAPMAP: Mapping Scientific Knowledge Gaps in Biomedical Literature Using Large Language Models
by: Salem, Nourah M, et al.
Published: (2025)
by: Salem, Nourah M, et al.
Published: (2025)
TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health
by: Xiong, Zixin, et al.
Published: (2026)
by: Xiong, Zixin, et al.
Published: (2026)
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems
by: Cui, Tianyu, et al.
Published: (2024)
by: Cui, Tianyu, et al.
Published: (2024)
Knowledge Tagging with Large Language Model based Multi-Agent System
by: Li, Hang, et al.
Published: (2024)
by: Li, Hang, et al.
Published: (2024)
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
by: Haimes, Jacob, et al.
Published: (2024)
by: Haimes, Jacob, et al.
Published: (2024)
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models
by: Jin, Zhuoran, et al.
Published: (2024)
by: Jin, Zhuoran, et al.
Published: (2024)
BAGEL: Benchmarking Animal Knowledge Expertise in Language Models
by: Shen, Jiacheng, et al.
Published: (2026)
by: Shen, Jiacheng, et al.
Published: (2026)
Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs
by: Aswal, Darpan, et al.
Published: (2025)
by: Aswal, Darpan, et al.
Published: (2025)
Revealing the Numeracy Gap: An Empirical Investigation of Text Embedding Models
by: Deng, Ningyuan, et al.
Published: (2025)
by: Deng, Ningyuan, et al.
Published: (2025)
"The Dentist is an involved parent, the bartender is not": Revealing Implicit Biases in QA with Implicit BBQ
by: Wagh, Aarushi, et al.
Published: (2025)
by: Wagh, Aarushi, et al.
Published: (2025)
WoLF: Wide-scope Large Language Model Framework for CXR Understanding
by: Kang, Seil, et al.
Published: (2024)
by: Kang, Seil, et al.
Published: (2024)
Similar Items
-
Medical Malice: A Dataset for Context-Aware Safety in Healthcare LLMs
by: D'addario, Andrew Maranhão Ventura
Published: (2025) -
FanOutQA: A Multi-Hop, Multi-Document Question Answering Benchmark for Large Language Models
by: Zhu, Andrew, et al.
Published: (2024) -
DateLogicQA: Benchmarking Temporal Biases in Large Language Models
by: Bhatia, Gagan, et al.
Published: (2024) -
DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation
by: Doris, Anna C., et al.
Published: (2024) -
Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models
by: Tan, Yingshui, et al.
Published: (2024)