:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Uluoglakci, Cem, Temizel, Tugba Taskaya
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2402.16211
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Inducing Epistemological Humility in Large Language Models: A Targeted SFT Approach to Reducing Hallucination
by: Uluoglakci, Cem, et al.
Published: (2026)

Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering
by: Mao, Nathan, et al.
Published: (2026)

Steerability of Instrumental-Convergence Tendencies in LLMs
by: Hoscilowicz, Jakub
Published: (2026)

HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation
by: Liu, Haokun, et al.
Published: (2025)

According to Me: Long-Term Personalized Referential Memory QA
by: Mei, Jingbiao, et al.
Published: (2026)

Hypo3D: Exploring Hypothetical Reasoning in 3D
by: Mao, Ye, et al.
Published: (2025)

HumMusQA: A Human-written Music Understanding QA Benchmark Dataset
by: Weck, Benno, et al.
Published: (2026)

RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content
by: Monteiro, Joao, et al.
Published: (2024)

Extending Czech Aspect-Based Sentiment Analysis with Opinion Terms: Dataset and LLM Benchmarks
by: Šmíd, Jakub, et al.
Published: (2026)

Factuality or Fiction? Benchmarking Modern LLMs on Ambiguous QA with Citations
by: Patel, Maya, et al.
Published: (2024)

Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions
by: Hou, Yutao, et al.
Published: (2024)

LLMs' Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements
by: Basmov, Victoria, et al.
Published: (2024)

NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates
by: Deng, Hexuan, et al.
Published: (2024)

Chart-HQA: A Benchmark for Hypothetical Question Answering in Charts
by: Chen, Xiangnan, et al.
Published: (2025)

DisasterQA: A Benchmark for Assessing the performance of LLMs in Disaster Response
by: Rawat, Rajat
Published: (2024)

SwaQuAD-24: QA Benchmark Dataset in Swahili
by: Kondoro, Alfred Malengo
Published: (2024)

RJUA-QA: A Comprehensive QA Dataset for Urology
by: Lyu, Shiwei, et al.
Published: (2023)

PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation
by: Carrillo-Larco, Rodrigo M., et al.
Published: (2025)

Leveraging LLMs for Hypothetical Deduction in Logical Inference: A Neuro-Symbolic Approach
by: Li, Qingchuan, et al.
Published: (2024)

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
by: Tavakoli, Mohammad, et al.
Published: (2025)

HypoSpace: A Diagnostic Benchmark for Set-Valued Hypothesis Generation under Underdetermination and Sublinear Coverage Bounds
by: Chen, Tingting, et al.
Published: (2025)

CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering
by: Li, Zongxi, et al.
Published: (2025)

Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning
by: Zheng, Danna, et al.
Published: (2024)

TA-Mem: Tool-Augmented Autonomous Memory Retrieval for LLM in Long-Term Conversational QA
by: Yuan, Mengwei, et al.
Published: (2026)

KoSimpleQA: A Korean Factuality Benchmark with an Analysis of Reasoning LLMs
by: Ko, Donghyeon, et al.
Published: (2025)

Semantic Reformulation Entropy for Robust Hallucination Detection in QA Tasks
by: Tong, Chaodong, et al.
Published: (2025)

Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs
by: Simhi, Adi, et al.
Published: (2024)

An Evaluation of LLMs for Detecting Harmful Computing Terms
by: Jacas, Joshua, et al.
Published: (2025)

Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
by: Du, Yanrui, et al.
Published: (2023)

QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA
by: Dineen, Jacob, et al.
Published: (2025)

BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets
by: Jonker, Richard A. A., et al.
Published: (2026)

PersonaVLM: Long-Term Personalized Multimodal LLMs
by: Nie, Chang, et al.
Published: (2026)

SciFaultyQA: Benchmarking LLMs on Faulty Science Question Detection with a GAN-Inspired Approach to Synthetic Dataset Generation
by: Kundu, Debarshi
Published: (2024)

Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs
by: Hu, Nan, et al.
Published: (2024)

From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
by: Uddin, Md Nayem, et al.
Published: (2026)

DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation
by: Rahman, A B M Ashikur, et al.
Published: (2024)

Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation
by: Wu, Xinwei, et al.
Published: (2025)

JBE-QA: Japanese Bar Exam QA Dataset for Assessing Legal Domain Knowledge
by: Cao, Zhihan, et al.
Published: (2025)

DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA
by: Yin, Jianing, et al.
Published: (2026)

AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset
by: Olatunji, Tobi, et al.
Published: (2024)