Saved in:
| Main Author: | Song, Tae-Eun |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.21454 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions
by: Song, Tae-Eun
Published: (2026)
by: Song, Tae-Eun
Published: (2026)
More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification
by: Tae-Eun, Song
Published: (2026)
by: Tae-Eun, Song
Published: (2026)
ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions
by: Ding, Xianzhong, et al.
Published: (2026)
by: Ding, Xianzhong, et al.
Published: (2026)
Towards Contamination Resistant Benchmarks
by: Musawi, Rahmatullah, et al.
Published: (2025)
by: Musawi, Rahmatullah, et al.
Published: (2025)
Contamination Report for Multilingual Benchmarks
by: Ahuja, Sanchit, et al.
Published: (2024)
by: Ahuja, Sanchit, et al.
Published: (2024)
LogitTrace: Detecting Benchmark Contamination via Layerwise Logit Trajectories
by: He, Zirui, et al.
Published: (2025)
by: He, Zirui, et al.
Published: (2025)
Detecting Data Contamination in LLMs via In-Context Learning
by: Zawalski, Michał, et al.
Published: (2025)
by: Zawalski, Michał, et al.
Published: (2025)
Cross-Session Threats in AI Agents: Benchmark, Evaluation, and Algorithms
by: Azarafrooz, Ari
Published: (2026)
by: Azarafrooz, Ari
Published: (2026)
HalluciNot: Hallucination Detection Through Context and Common Knowledge Verification
by: Paudel, Bibek, et al.
Published: (2025)
by: Paudel, Bibek, et al.
Published: (2025)
Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks
by: Song, Yiliang, et al.
Published: (2026)
by: Song, Yiliang, et al.
Published: (2026)
Emergent Inference-Time Semantic Contamination via In-Context Priming
by: Abram, Marcin
Published: (2026)
by: Abram, Marcin
Published: (2026)
SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
by: Safarzadeh, Mohammadtaher, et al.
Published: (2026)
by: Safarzadeh, Mohammadtaher, et al.
Published: (2026)
HCRE: LLM-based Hierarchical Classification for Cross-Document Relation Extraction with a Prediction-then-Verification Strategy
by: Ma, Guoqi, et al.
Published: (2026)
by: Ma, Guoqi, et al.
Published: (2026)
Benchmark Data Contamination of Large Language Models: A Survey
by: Xu, Cheng, et al.
Published: (2024)
by: Xu, Cheng, et al.
Published: (2024)
Data Contamination Can Cross Language Barriers
by: Yao, Feng, et al.
Published: (2024)
by: Yao, Feng, et al.
Published: (2024)
KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts
by: Hwang, Taebaek, et al.
Published: (2025)
by: Hwang, Taebaek, et al.
Published: (2025)
Hierarchical Verification of Speculative Beams for Accelerating LLM Inference
by: Sen, Jaydip, et al.
Published: (2025)
by: Sen, Jaydip, et al.
Published: (2025)
PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models
by: Zhang, Huixuan, et al.
Published: (2024)
by: Zhang, Huixuan, et al.
Published: (2024)
CAP: Data Contamination Detection via Consistency Amplification
by: Zhao, Yi, et al.
Published: (2024)
by: Zhao, Yi, et al.
Published: (2024)
EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory
by: Shen, Ye, et al.
Published: (2026)
by: Shen, Ye, et al.
Published: (2026)
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
by: He, Zexue, et al.
Published: (2026)
by: He, Zexue, et al.
Published: (2026)
INSURE-Dial: A Phase-Aware Conversational Dataset & Benchmark for Compliance Verification and Phase Detection
by: Kulkarni, Shubham, et al.
Published: (2026)
by: Kulkarni, Shubham, et al.
Published: (2026)
Poly-FEVER: A Multilingual Fact Verification Benchmark for Hallucination Detection in Large Language Models
by: Zhang, Hanzhi, et al.
Published: (2025)
by: Zhang, Hanzhi, et al.
Published: (2025)
Investigating Data Contamination in Modern Benchmarks for Large Language Models
by: Deng, Chunyuan, et al.
Published: (2023)
by: Deng, Chunyuan, et al.
Published: (2023)
Latent Preference Modeling for Cross-Session Personalized Tool Calling
by: Yoon, Yejin, et al.
Published: (2026)
by: Yoon, Yejin, et al.
Published: (2026)
GETReason: Enhancing Image Context Extraction through Hierarchical Multi-Agent Reasoning
by: Siingh, Shikhhar, et al.
Published: (2025)
by: Siingh, Shikhhar, et al.
Published: (2025)
Beyond Isolated Behaviors: Hierarchical User Modeling for LLM Personalization
by: Wang, Liang, et al.
Published: (2026)
by: Wang, Liang, et al.
Published: (2026)
SCCD: A Session-based Dataset for Chinese Cyberbullying Detection
by: Yang, Qingpo, et al.
Published: (2025)
by: Yang, Qingpo, et al.
Published: (2025)
When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation
by: Tan, David, et al.
Published: (2026)
by: Tan, David, et al.
Published: (2026)
Analysis of Speech Temporal Dynamics in the Context of Speaker Verification and Voice Anonymization
by: Tomashenko, Natalia, et al.
Published: (2024)
by: Tomashenko, Natalia, et al.
Published: (2024)
Enhancing LLM Language Adaption through Cross-lingual In-Context Pre-training
by: Wu, Linjuan, et al.
Published: (2025)
by: Wu, Linjuan, et al.
Published: (2025)
Adaptive Cross-lingual Text Classification through In-Context One-Shot Demonstrations
by: Villa-Cueva, Emilio, et al.
Published: (2024)
by: Villa-Cueva, Emilio, et al.
Published: (2024)
Designing and Evaluating Multi-Chatbot Interface for Human-AI Communication: Preliminary Findings from a Persuasion Task
by: Yoon, Sion, et al.
Published: (2024)
by: Yoon, Sion, et al.
Published: (2024)
TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs
by: Rajore, Tanmay, et al.
Published: (2024)
by: Rajore, Tanmay, et al.
Published: (2024)
EcoSafeRAG: Efficient Security through Context Analysis in Retrieval-Augmented Generation
by: Yao, Ruobing, et al.
Published: (2025)
by: Yao, Ruobing, et al.
Published: (2025)
Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models
by: Golchin, Shahriar, et al.
Published: (2023)
by: Golchin, Shahriar, et al.
Published: (2023)
Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling
by: Maekawa, Seiji, et al.
Published: (2025)
by: Maekawa, Seiji, et al.
Published: (2025)
Retromorphic Testing with Hierarchical Verification for Hallucination Detection in RAG
by: Yu, Boxi, et al.
Published: (2026)
by: Yu, Boxi, et al.
Published: (2026)
Quantifying Data Contamination in Psychometric Evaluations of LLMs
by: Han, Jongwook, et al.
Published: (2025)
by: Han, Jongwook, et al.
Published: (2025)
HLL: Can Agents Cross Humanity's Last Line of Verification?
by: Song, Xinhao, et al.
Published: (2026)
by: Song, Xinhao, et al.
Published: (2026)
Similar Items
-
Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions
by: Song, Tae-Eun
Published: (2026) -
More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification
by: Tae-Eun, Song
Published: (2026) -
ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions
by: Ding, Xianzhong, et al.
Published: (2026) -
Towards Contamination Resistant Benchmarks
by: Musawi, Rahmatullah, et al.
Published: (2025) -
Contamination Report for Multilingual Benchmarks
by: Ahuja, Sanchit, et al.
Published: (2024)