:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Song, Tae-Eun
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2603.21454
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions
by: Song, Tae-Eun
Published: (2026)

More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification
by: Tae-Eun, Song
Published: (2026)

ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions
by: Ding, Xianzhong, et al.
Published: (2026)

Towards Contamination Resistant Benchmarks
by: Musawi, Rahmatullah, et al.
Published: (2025)

Contamination Report for Multilingual Benchmarks
by: Ahuja, Sanchit, et al.
Published: (2024)

LogitTrace: Detecting Benchmark Contamination via Layerwise Logit Trajectories
by: He, Zirui, et al.
Published: (2025)

Detecting Data Contamination in LLMs via In-Context Learning
by: Zawalski, Michał, et al.
Published: (2025)

Cross-Session Threats in AI Agents: Benchmark, Evaluation, and Algorithms
by: Azarafrooz, Ari
Published: (2026)

HalluciNot: Hallucination Detection Through Context and Common Knowledge Verification
by: Paudel, Bibek, et al.
Published: (2025)

Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks
by: Song, Yiliang, et al.
Published: (2026)

Emergent Inference-Time Semantic Contamination via In-Context Priming
by: Abram, Marcin
Published: (2026)

SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
by: Safarzadeh, Mohammadtaher, et al.
Published: (2026)

HCRE: LLM-based Hierarchical Classification for Cross-Document Relation Extraction with a Prediction-then-Verification Strategy
by: Ma, Guoqi, et al.
Published: (2026)

Benchmark Data Contamination of Large Language Models: A Survey
by: Xu, Cheng, et al.
Published: (2024)

Data Contamination Can Cross Language Barriers
by: Yao, Feng, et al.
Published: (2024)

KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts
by: Hwang, Taebaek, et al.
Published: (2025)

Hierarchical Verification of Speculative Beams for Accelerating LLM Inference
by: Sen, Jaydip, et al.
Published: (2025)

PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models
by: Zhang, Huixuan, et al.
Published: (2024)

CAP: Data Contamination Detection via Consistency Amplification
by: Zhao, Yi, et al.
Published: (2024)

EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory
by: Shen, Ye, et al.
Published: (2026)

MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
by: He, Zexue, et al.
Published: (2026)

INSURE-Dial: A Phase-Aware Conversational Dataset & Benchmark for Compliance Verification and Phase Detection
by: Kulkarni, Shubham, et al.
Published: (2026)

Poly-FEVER: A Multilingual Fact Verification Benchmark for Hallucination Detection in Large Language Models
by: Zhang, Hanzhi, et al.
Published: (2025)

Investigating Data Contamination in Modern Benchmarks for Large Language Models
by: Deng, Chunyuan, et al.
Published: (2023)

Latent Preference Modeling for Cross-Session Personalized Tool Calling
by: Yoon, Yejin, et al.
Published: (2026)

GETReason: Enhancing Image Context Extraction through Hierarchical Multi-Agent Reasoning
by: Siingh, Shikhhar, et al.
Published: (2025)

Beyond Isolated Behaviors: Hierarchical User Modeling for LLM Personalization
by: Wang, Liang, et al.
Published: (2026)

SCCD: A Session-based Dataset for Chinese Cyberbullying Detection
by: Yang, Qingpo, et al.
Published: (2025)

When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation
by: Tan, David, et al.
Published: (2026)

Analysis of Speech Temporal Dynamics in the Context of Speaker Verification and Voice Anonymization
by: Tomashenko, Natalia, et al.
Published: (2024)

Enhancing LLM Language Adaption through Cross-lingual In-Context Pre-training
by: Wu, Linjuan, et al.
Published: (2025)

Adaptive Cross-lingual Text Classification through In-Context One-Shot Demonstrations
by: Villa-Cueva, Emilio, et al.
Published: (2024)

Designing and Evaluating Multi-Chatbot Interface for Human-AI Communication: Preliminary Findings from a Persuasion Task
by: Yoon, Sion, et al.
Published: (2024)

TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs
by: Rajore, Tanmay, et al.
Published: (2024)

EcoSafeRAG: Efficient Security through Context Analysis in Retrieval-Augmented Generation
by: Yao, Ruobing, et al.
Published: (2025)

Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models
by: Golchin, Shahriar, et al.
Published: (2023)

Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling
by: Maekawa, Seiji, et al.
Published: (2025)

Retromorphic Testing with Hierarchical Verification for Hallucination Detection in RAG
by: Yu, Boxi, et al.
Published: (2026)

Quantifying Data Contamination in Psychometric Evaluations of LLMs
by: Han, Jongwook, et al.
Published: (2025)

HLL: Can Agents Cross Humanity's Last Line of Verification?
by: Song, Xinhao, et al.
Published: (2026)