Saved in:
| Main Author: | Fan, Yang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2501.13983 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction
by: Li, Yucheng, et al.
Published: (2023)
by: Li, Yucheng, et al.
Published: (2023)
EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models
by: Mohammadi, Hadi, et al.
Published: (2025)
by: Mohammadi, Hadi, et al.
Published: (2025)
CriticEval: Evaluating Large Language Model as Critic
by: Lan, Tian, et al.
Published: (2024)
by: Lan, Tian, et al.
Published: (2024)
FedEval-LLM: Federated Evaluation of Large Language Models on Downstream Tasks with Collective Wisdom
by: He, Yuanqin, et al.
Published: (2024)
by: He, Yuanqin, et al.
Published: (2024)
TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning
by: Li, Xiang, et al.
Published: (2024)
by: Li, Xiang, et al.
Published: (2024)
An Open Source Data Contamination Report for Large Language Models
by: Li, Yucheng, et al.
Published: (2023)
by: Li, Yucheng, et al.
Published: (2023)
Investigating Data Contamination in Modern Benchmarks for Large Language Models
by: Deng, Chunyuan, et al.
Published: (2023)
by: Deng, Chunyuan, et al.
Published: (2023)
Likelihood-based Mitigation of Evaluation Bias in Large Language Models
by: Oi, Masanari, et al.
Published: (2024)
by: Oi, Masanari, et al.
Published: (2024)
Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models
by: Golchin, Shahriar, et al.
Published: (2023)
by: Golchin, Shahriar, et al.
Published: (2023)
LexInstructEval: Lexical Instruction Following Evaluation for Large Language Models
by: Ren, Huimin, et al.
Published: (2025)
by: Ren, Huimin, et al.
Published: (2025)
WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models
by: Gupta, Prannaya, et al.
Published: (2024)
by: Gupta, Prannaya, et al.
Published: (2024)
CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models
by: Yu, Linhao, et al.
Published: (2024)
by: Yu, Linhao, et al.
Published: (2024)
Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination
by: Chen, Simin, et al.
Published: (2025)
by: Chen, Simin, et al.
Published: (2025)
Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models
by: Dong, Yihong, et al.
Published: (2024)
by: Dong, Yihong, et al.
Published: (2024)
FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models
by: Yu, Zhuohao, et al.
Published: (2024)
by: Yu, Zhuohao, et al.
Published: (2024)
IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models
by: Bharti, Saurabh, et al.
Published: (2026)
by: Bharti, Saurabh, et al.
Published: (2026)
ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models
by: Nguyen, Trong-Hieu, et al.
Published: (2024)
by: Nguyen, Trong-Hieu, et al.
Published: (2024)
NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes
by: Fan, Lizhou, et al.
Published: (2023)
by: Fan, Lizhou, et al.
Published: (2023)
The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models
by: Katzy, Jonathan, et al.
Published: (2025)
by: Katzy, Jonathan, et al.
Published: (2025)
GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models
by: Zhang, Tao, et al.
Published: (2024)
by: Zhang, Tao, et al.
Published: (2024)
EpiK-Eval: Evaluation for Language Models as Epistemic Models
by: Prato, Gabriele, et al.
Published: (2023)
by: Prato, Gabriele, et al.
Published: (2023)
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
by: Chen, Simin, et al.
Published: (2025)
by: Chen, Simin, et al.
Published: (2025)
EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education
by: Ma, Guoqing, et al.
Published: (2025)
by: Ma, Guoqing, et al.
Published: (2025)
StrucText-Eval: Evaluating Large Language Model's Reasoning Ability in Structure-Rich Text
by: Gu, Zhouhong, et al.
Published: (2024)
by: Gu, Zhouhong, et al.
Published: (2024)
Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation
by: Zhang, Yichi, et al.
Published: (2025)
by: Zhang, Yichi, et al.
Published: (2025)
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models
by: Wang, Fei, et al.
Published: (2024)
by: Wang, Fei, et al.
Published: (2024)
CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation
by: Zhao, Jingqian, et al.
Published: (2025)
by: Zhao, Jingqian, et al.
Published: (2025)
MTQ-Eval: Multilingual Text Quality Evaluation for Language Models
by: Pokharel, Rhitabrat, et al.
Published: (2025)
by: Pokharel, Rhitabrat, et al.
Published: (2025)
Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges
by: Samuel, Vinay, et al.
Published: (2024)
by: Samuel, Vinay, et al.
Published: (2024)
Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines
by: Song, Hwanjun
Published: (2026)
by: Song, Hwanjun
Published: (2026)
MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models
by: Lee, Suhyun, et al.
Published: (2026)
by: Lee, Suhyun, et al.
Published: (2026)
SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks
by: Li, Tianhao, et al.
Published: (2024)
by: Li, Tianhao, et al.
Published: (2024)
BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali
by: Adib, Shefayat E Shams, et al.
Published: (2026)
by: Adib, Shefayat E Shams, et al.
Published: (2026)
R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models
by: Tu, Shangqing, et al.
Published: (2024)
by: Tu, Shangqing, et al.
Published: (2024)
Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models
by: Patel, Nisarg, et al.
Published: (2024)
by: Patel, Nisarg, et al.
Published: (2024)
Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models
by: Jiang, Chaoya, et al.
Published: (2024)
by: Jiang, Chaoya, et al.
Published: (2024)
Sensitivity of Small Language Models to Fine-tuning Data Contamination
by: Scaria, Nicy, et al.
Published: (2025)
by: Scaria, Nicy, et al.
Published: (2025)
EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria
by: Kim, Tae Soo, et al.
Published: (2023)
by: Kim, Tae Soo, et al.
Published: (2023)
ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages
by: Kammakomati, Mehant, et al.
Published: (2024)
by: Kammakomati, Mehant, et al.
Published: (2024)
ArxEval: Evaluating Retrieval and Generation in Language Models for Scientific Literature
by: Sinha, Aarush, et al.
Published: (2025)
by: Sinha, Aarush, et al.
Published: (2025)
Similar Items
-
LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction
by: Li, Yucheng, et al.
Published: (2023) -
EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models
by: Mohammadi, Hadi, et al.
Published: (2025) -
CriticEval: Evaluating Large Language Model as Critic
by: Lan, Tian, et al.
Published: (2024) -
FedEval-LLM: Federated Evaluation of Large Language Models on Downstream Tasks with Collective Wisdom
by: He, Yuanqin, et al.
Published: (2024) -
TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning
by: Li, Xiang, et al.
Published: (2024)