Saved in:
| Main Authors: | Uluoglakci, Cem, Temizel, Tugba Taskaya |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2402.16211 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Inducing Epistemological Humility in Large Language Models: A Targeted SFT Approach to Reducing Hallucination
by: Uluoglakci, Cem, et al.
Published: (2026)
by: Uluoglakci, Cem, et al.
Published: (2026)
Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering
by: Mao, Nathan, et al.
Published: (2026)
by: Mao, Nathan, et al.
Published: (2026)
Steerability of Instrumental-Convergence Tendencies in LLMs
by: Hoscilowicz, Jakub
Published: (2026)
by: Hoscilowicz, Jakub
Published: (2026)
HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation
by: Liu, Haokun, et al.
Published: (2025)
by: Liu, Haokun, et al.
Published: (2025)
According to Me: Long-Term Personalized Referential Memory QA
by: Mei, Jingbiao, et al.
Published: (2026)
by: Mei, Jingbiao, et al.
Published: (2026)
Hypo3D: Exploring Hypothetical Reasoning in 3D
by: Mao, Ye, et al.
Published: (2025)
by: Mao, Ye, et al.
Published: (2025)
HumMusQA: A Human-written Music Understanding QA Benchmark Dataset
by: Weck, Benno, et al.
Published: (2026)
by: Weck, Benno, et al.
Published: (2026)
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content
by: Monteiro, Joao, et al.
Published: (2024)
by: Monteiro, Joao, et al.
Published: (2024)
Extending Czech Aspect-Based Sentiment Analysis with Opinion Terms: Dataset and LLM Benchmarks
by: Šmíd, Jakub, et al.
Published: (2026)
by: Šmíd, Jakub, et al.
Published: (2026)
Factuality or Fiction? Benchmarking Modern LLMs on Ambiguous QA with Citations
by: Patel, Maya, et al.
Published: (2024)
by: Patel, Maya, et al.
Published: (2024)
Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions
by: Hou, Yutao, et al.
Published: (2024)
by: Hou, Yutao, et al.
Published: (2024)
LLMs' Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements
by: Basmov, Victoria, et al.
Published: (2024)
by: Basmov, Victoria, et al.
Published: (2024)
NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates
by: Deng, Hexuan, et al.
Published: (2024)
by: Deng, Hexuan, et al.
Published: (2024)
Chart-HQA: A Benchmark for Hypothetical Question Answering in Charts
by: Chen, Xiangnan, et al.
Published: (2025)
by: Chen, Xiangnan, et al.
Published: (2025)
DisasterQA: A Benchmark for Assessing the performance of LLMs in Disaster Response
by: Rawat, Rajat
Published: (2024)
by: Rawat, Rajat
Published: (2024)
SwaQuAD-24: QA Benchmark Dataset in Swahili
by: Kondoro, Alfred Malengo
Published: (2024)
by: Kondoro, Alfred Malengo
Published: (2024)
RJUA-QA: A Comprehensive QA Dataset for Urology
by: Lyu, Shiwei, et al.
Published: (2023)
by: Lyu, Shiwei, et al.
Published: (2023)
PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation
by: Carrillo-Larco, Rodrigo M., et al.
Published: (2025)
by: Carrillo-Larco, Rodrigo M., et al.
Published: (2025)
Leveraging LLMs for Hypothetical Deduction in Logical Inference: A Neuro-Symbolic Approach
by: Li, Qingchuan, et al.
Published: (2024)
by: Li, Qingchuan, et al.
Published: (2024)
Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
by: Tavakoli, Mohammad, et al.
Published: (2025)
by: Tavakoli, Mohammad, et al.
Published: (2025)
HypoSpace: A Diagnostic Benchmark for Set-Valued Hypothesis Generation under Underdetermination and Sublinear Coverage Bounds
by: Chen, Tingting, et al.
Published: (2025)
by: Chen, Tingting, et al.
Published: (2025)
CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering
by: Li, Zongxi, et al.
Published: (2025)
by: Li, Zongxi, et al.
Published: (2025)
Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning
by: Zheng, Danna, et al.
Published: (2024)
by: Zheng, Danna, et al.
Published: (2024)
TA-Mem: Tool-Augmented Autonomous Memory Retrieval for LLM in Long-Term Conversational QA
by: Yuan, Mengwei, et al.
Published: (2026)
by: Yuan, Mengwei, et al.
Published: (2026)
KoSimpleQA: A Korean Factuality Benchmark with an Analysis of Reasoning LLMs
by: Ko, Donghyeon, et al.
Published: (2025)
by: Ko, Donghyeon, et al.
Published: (2025)
Semantic Reformulation Entropy for Robust Hallucination Detection in QA Tasks
by: Tong, Chaodong, et al.
Published: (2025)
by: Tong, Chaodong, et al.
Published: (2025)
Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs
by: Simhi, Adi, et al.
Published: (2024)
by: Simhi, Adi, et al.
Published: (2024)
An Evaluation of LLMs for Detecting Harmful Computing Terms
by: Jacas, Joshua, et al.
Published: (2025)
by: Jacas, Joshua, et al.
Published: (2025)
Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
by: Du, Yanrui, et al.
Published: (2023)
by: Du, Yanrui, et al.
Published: (2023)
QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA
by: Dineen, Jacob, et al.
Published: (2025)
by: Dineen, Jacob, et al.
Published: (2025)
BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets
by: Jonker, Richard A. A., et al.
Published: (2026)
by: Jonker, Richard A. A., et al.
Published: (2026)
PersonaVLM: Long-Term Personalized Multimodal LLMs
by: Nie, Chang, et al.
Published: (2026)
by: Nie, Chang, et al.
Published: (2026)
SciFaultyQA: Benchmarking LLMs on Faulty Science Question Detection with a GAN-Inspired Approach to Synthetic Dataset Generation
by: Kundu, Debarshi
Published: (2024)
by: Kundu, Debarshi
Published: (2024)
Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs
by: Hu, Nan, et al.
Published: (2024)
by: Hu, Nan, et al.
Published: (2024)
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
by: Uddin, Md Nayem, et al.
Published: (2026)
by: Uddin, Md Nayem, et al.
Published: (2026)
DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation
by: Rahman, A B M Ashikur, et al.
Published: (2024)
by: Rahman, A B M Ashikur, et al.
Published: (2024)
Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation
by: Wu, Xinwei, et al.
Published: (2025)
by: Wu, Xinwei, et al.
Published: (2025)
JBE-QA: Japanese Bar Exam QA Dataset for Assessing Legal Domain Knowledge
by: Cao, Zhihan, et al.
Published: (2025)
by: Cao, Zhihan, et al.
Published: (2025)
DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA
by: Yin, Jianing, et al.
Published: (2026)
by: Yin, Jianing, et al.
Published: (2026)
AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset
by: Olatunji, Tobi, et al.
Published: (2024)
by: Olatunji, Tobi, et al.
Published: (2024)
Similar Items
-
Inducing Epistemological Humility in Large Language Models: A Targeted SFT Approach to Reducing Hallucination
by: Uluoglakci, Cem, et al.
Published: (2026) -
Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering
by: Mao, Nathan, et al.
Published: (2026) -
Steerability of Instrumental-Convergence Tendencies in LLMs
by: Hoscilowicz, Jakub
Published: (2026) -
HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation
by: Liu, Haokun, et al.
Published: (2025) -
According to Me: Long-Term Personalized Referential Memory QA
by: Mei, Jingbiao, et al.
Published: (2026)