Saved in:
| Main Author: | Bugaud, Zacharie |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.10825 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles
by: Bugaud, Zacharie
Published: (2026)
by: Bugaud, Zacharie
Published: (2026)
Multi-RF Fusion with Multi-GNN Blending for Molecular Property Prediction
by: Bugaud, Zacharie
Published: (2026)
by: Bugaud, Zacharie
Published: (2026)
Cortex-Inspired Continual Learning: Unsupervised Instantiation and Recovery of Functional Task Networks
by: McKee, Kevin, et al.
Published: (2026)
by: McKee, Kevin, et al.
Published: (2026)
Rodent-Bench
by: Heap, Thomas, et al.
Published: (2026)
by: Heap, Thomas, et al.
Published: (2026)
Goal-Directed Search Outperforms Goal-Agnostic Memory Compression in Long-Context Memory Tasks
by: Zheng, Yicong, et al.
Published: (2025)
by: Zheng, Yicong, et al.
Published: (2025)
CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy
by: Zhang, Mian, et al.
Published: (2024)
by: Zhang, Mian, et al.
Published: (2024)
ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models
by: Qin, Chonghan, et al.
Published: (2026)
by: Qin, Chonghan, et al.
Published: (2026)
Reflection-Bench: Evaluating Epistemic Agency in Large Language Models
by: Li, Lingyu, et al.
Published: (2024)
by: Li, Lingyu, et al.
Published: (2024)
CLR-Bench: Evaluating Large Language Models in College-level Reasoning
by: Dong, Junnan, et al.
Published: (2024)
by: Dong, Junnan, et al.
Published: (2024)
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
by: Xie, Tinghao, et al.
Published: (2024)
by: Xie, Tinghao, et al.
Published: (2024)
EmoBench: Evaluating the Emotional Intelligence of Large Language Models
by: Sabour, Sahand, et al.
Published: (2024)
by: Sabour, Sahand, et al.
Published: (2024)
SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models
by: Monti, Sebastiano, et al.
Published: (2026)
by: Monti, Sebastiano, et al.
Published: (2026)
ElecBench: a Power Dispatch Evaluation Benchmark for Large Language Models
by: Zhou, Xiyuan, et al.
Published: (2024)
by: Zhou, Xiyuan, et al.
Published: (2024)
MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models
by: Han, Tessa, et al.
Published: (2024)
by: Han, Tessa, et al.
Published: (2024)
ExKG-LLM: Leveraging Large Language Models for Automated Expansion of Cognitive Neuroscience Knowledge Graphs
by: Sarabadani, Ali, et al.
Published: (2025)
by: Sarabadani, Ali, et al.
Published: (2025)
TurkBench: A Benchmark for Evaluating Turkish Large Language Models
by: Toraman, Çağrı, et al.
Published: (2026)
by: Toraman, Çağrı, et al.
Published: (2026)
MedCalc-Bench: Evaluating Large Language Models for Medical Calculations
by: Khandekar, Nikhil, et al.
Published: (2024)
by: Khandekar, Nikhil, et al.
Published: (2024)
WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models
by: Tu, Shangqing, et al.
Published: (2023)
by: Tu, Shangqing, et al.
Published: (2023)
PsychCounsel-Bench: Evaluating the Psychology Intelligence of Large Language Models
by: Zeng, Min
Published: (2025)
by: Zeng, Min
Published: (2025)
SarcasmBench: Towards Evaluating Large Language Models on Sarcasm Understanding
by: Zhang, Yazhou, et al.
Published: (2024)
by: Zhang, Yazhou, et al.
Published: (2024)
RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models
by: Yan, Jianhao, et al.
Published: (2024)
by: Yan, Jianhao, et al.
Published: (2024)
StyleBench: Evaluating thinking styles in Large Language Models
by: Guo, Junyu, et al.
Published: (2025)
by: Guo, Junyu, et al.
Published: (2025)
DebugBench: Evaluating Debugging Capability of Large Language Models
by: Tian, Runchu, et al.
Published: (2024)
by: Tian, Runchu, et al.
Published: (2024)
EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving
by: Zhou, Xiyuan, et al.
Published: (2025)
by: Zhou, Xiyuan, et al.
Published: (2025)
A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models
by: Huang, Zhongzhan, et al.
Published: (2025)
by: Huang, Zhongzhan, et al.
Published: (2025)
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
by: Guo, Qianhong, et al.
Published: (2025)
by: Guo, Qianhong, et al.
Published: (2025)
ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models
by: Alhumud, Anas, et al.
Published: (2026)
by: Alhumud, Anas, et al.
Published: (2026)
LemonadeBench: Evaluating the Economic Intuition of Large Language Models in Simple Markets
by: Vyas, Aidan
Published: (2026)
by: Vyas, Aidan
Published: (2026)
BarrierBench: Evaluating Large Language Models for Safety Verification in Dynamical Systems
by: Taheri, Ali, et al.
Published: (2025)
by: Taheri, Ali, et al.
Published: (2025)
InFoBench: Evaluating Instruction Following Ability in Large Language Models
by: Qin, Yiwei, et al.
Published: (2024)
by: Qin, Yiwei, et al.
Published: (2024)
FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models
by: Li, Wei, et al.
Published: (2024)
by: Li, Wei, et al.
Published: (2024)
TeachBench: A Syllabus-Grounded Framework for Evaluating Teaching Ability in Large Language Models
by: Li, Zheng, et al.
Published: (2026)
by: Li, Zheng, et al.
Published: (2026)
PromptBench: A Unified Library for Evaluation of Large Language Models
by: Zhu, Kaijie, et al.
Published: (2023)
by: Zhu, Kaijie, et al.
Published: (2023)
CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks
by: Feng, Jie, et al.
Published: (2024)
by: Feng, Jie, et al.
Published: (2024)
TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation
by: Yang, Xiaoda, et al.
Published: (2026)
by: Yang, Xiaoda, et al.
Published: (2026)
SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
by: Hu, Tiancheng, et al.
Published: (2025)
by: Hu, Tiancheng, et al.
Published: (2025)
MultiCNKG: Integrating Cognitive Neuroscience, Gene, and Disease Knowledge Graphs Using Large Language Models
by: Sarabadani, Ali, et al.
Published: (2025)
by: Sarabadani, Ali, et al.
Published: (2025)
CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics
by: Wang, Weida, et al.
Published: (2025)
by: Wang, Weida, et al.
Published: (2025)
AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science
by: Li, Chenyue, et al.
Published: (2025)
by: Li, Chenyue, et al.
Published: (2025)
LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models
by: Parmar, Mihir, et al.
Published: (2024)
by: Parmar, Mihir, et al.
Published: (2024)
Similar Items
-
Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles
by: Bugaud, Zacharie
Published: (2026) -
Multi-RF Fusion with Multi-GNN Blending for Molecular Property Prediction
by: Bugaud, Zacharie
Published: (2026) -
Cortex-Inspired Continual Learning: Unsupervised Instantiation and Recovery of Functional Task Networks
by: McKee, Kevin, et al.
Published: (2026) -
Rodent-Bench
by: Heap, Thomas, et al.
Published: (2026) -
Goal-Directed Search Outperforms Goal-Agnostic Memory Compression in Long-Context Memory Tasks
by: Zheng, Yicong, et al.
Published: (2025)