Saved in:
| Main Authors: | Song, Xidan, Wang, Weiqi, Cao, Ruifeng, Hu, Qingya |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.15033 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ChessQA: Evaluating Large Language Models for Chess Understanding
by: Wen, Qianfeng, et al.
Published: (2025)
by: Wen, Qianfeng, et al.
Published: (2025)
ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models
by: Liu, Jincheng, et al.
Published: (2025)
by: Liu, Jincheng, et al.
Published: (2025)
Assisting Research Proposal Writing with Large Language Models: Evaluation and Refinement
by: Ren, Jing, et al.
Published: (2025)
by: Ren, Jing, et al.
Published: (2025)
Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability
by: Jiang, Xinyan, et al.
Published: (2026)
by: Jiang, Xinyan, et al.
Published: (2026)
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
by: Mondorf, Philipp, et al.
Published: (2024)
by: Mondorf, Philipp, et al.
Published: (2024)
Tracking World States with Language Models: State-Based Evaluation Using Chess
by: Harang, Romain, et al.
Published: (2025)
by: Harang, Romain, et al.
Published: (2025)
Predicting Human Chess Moves: An AI Assisted Analysis of Chess Games Using Skill-group Specific n-gram Language Models
by: Zhong, Daren, et al.
Published: (2025)
by: Zhong, Daren, et al.
Published: (2025)
Measuring Stability Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology
by: D'Amario, Vanessa, et al.
Published: (2025)
by: D'Amario, Vanessa, et al.
Published: (2025)
VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling
by: Li, Weiqi, et al.
Published: (2025)
by: Li, Weiqi, et al.
Published: (2025)
Personalized Large Language Model Assistant with Evolving Conditional Memory
by: Yuan, Ruifeng, et al.
Published: (2023)
by: Yuan, Ruifeng, et al.
Published: (2023)
Grounded Chess Reasoning in Language Models via Master Distillation
by: Tang, Zhenwei, et al.
Published: (2026)
by: Tang, Zhenwei, et al.
Published: (2026)
Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models
by: Huang, Liangjie, et al.
Published: (2025)
by: Huang, Liangjie, et al.
Published: (2025)
Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models
by: Mächtle, Felix, et al.
Published: (2026)
by: Mächtle, Felix, et al.
Published: (2026)
Beyond Multiple-Choice Accuracy: Real-World Challenges of Implementing Large Language Models in Healthcare
by: Yang, Yifan, et al.
Published: (2024)
by: Yang, Yifan, et al.
Published: (2024)
Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models
by: Tang, Ethan
Published: (2026)
by: Tang, Ethan
Published: (2026)
Mixture of Masters: Sparse Chess Language Models with Player Routing
by: Frisoni, Giacomo, et al.
Published: (2026)
by: Frisoni, Giacomo, et al.
Published: (2026)
GraphScout: Empowering Large Language Models with Intrinsic Exploration Ability for Agentic Graph Reasoning
by: Ying, Yuchen, et al.
Published: (2026)
by: Ying, Yuchen, et al.
Published: (2026)
Toward Modeling Player-Specific Chess Behaviors
by: Sogliuzzo, Loris, et al.
Published: (2026)
by: Sogliuzzo, Loris, et al.
Published: (2026)
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning
by: Yang, Xia, et al.
Published: (2026)
by: Yang, Xia, et al.
Published: (2026)
Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess
by: Hwang, Dongyoon, et al.
Published: (2025)
by: Hwang, Dongyoon, et al.
Published: (2025)
The ORCA Benchmark: Evaluating Real-World Calculation Accuracy in Large Language Models
by: Herambourg, Claudia, et al.
Published: (2025)
by: Herambourg, Claudia, et al.
Published: (2025)
MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual Property
by: Ni, Shiwen, et al.
Published: (2024)
by: Ni, Shiwen, et al.
Published: (2024)
Bridging the Gap between Expert and Language Models: Concept-guided Chess Commentary Generation and Evaluation
by: Kim, Jaechang, et al.
Published: (2024)
by: Kim, Jaechang, et al.
Published: (2024)
Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks
by: Pimentel, Marco AF, et al.
Published: (2024)
by: Pimentel, Marco AF, et al.
Published: (2024)
Complete Chess Games Enable LLM Become A Chess Master
by: Zhang, Yinqi, et al.
Published: (2025)
by: Zhang, Yinqi, et al.
Published: (2025)
Beyond Accuracy: Evaluating Forecasting Models by Multi-Echelon Inventory Cost
by: Marik, Swata, et al.
Published: (2026)
by: Marik, Swata, et al.
Published: (2026)
Beyond Lines and Circles: Unveiling the Geometric Reasoning Gap in Large Language Models
by: Mouselinos, Spyridon, et al.
Published: (2024)
by: Mouselinos, Spyridon, et al.
Published: (2024)
An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress
by: Karimov, Hikmat, et al.
Published: (2026)
by: Karimov, Hikmat, et al.
Published: (2026)
Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering
by: Chen, Zixin, et al.
Published: (2025)
by: Chen, Zixin, et al.
Published: (2025)
Amortized Planning with Large-Scale Transformers: A Case Study on Chess
by: Ruoss, Anian, et al.
Published: (2024)
by: Ruoss, Anian, et al.
Published: (2024)
Evaluating In Silico Creativity: An Expert Review of AI Chess Compositions
by: Veeriah, Vivek, et al.
Published: (2025)
by: Veeriah, Vivek, et al.
Published: (2025)
NUMCoT: Numerals and Units of Measurement in Chain-of-Thought Reasoning using Large Language Models
by: Xu, Ancheng, et al.
Published: (2024)
by: Xu, Ancheng, et al.
Published: (2024)
Abstract Concept Modelling in Conceptual Spaces: A Study on Chess Strategies
by: Banaee, Hadi, et al.
Published: (2026)
by: Banaee, Hadi, et al.
Published: (2026)
Maia-2: A Unified Model for Human-AI Alignment in Chess
by: Tang, Zhenwei, et al.
Published: (2024)
by: Tang, Zhenwei, et al.
Published: (2024)
AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field
by: Liang, Chen, et al.
Published: (2025)
by: Liang, Chen, et al.
Published: (2025)
Learning to Imitate with Less: Efficient Individual Behavior Modeling in Chess
by: Tang, Zhenwei, et al.
Published: (2025)
by: Tang, Zhenwei, et al.
Published: (2025)
A Comprehensive Evaluation of Large Language Models on Aspect-Based Sentiment Analysis
by: Zhou, Changzhi, et al.
Published: (2024)
by: Zhou, Changzhi, et al.
Published: (2024)
Generating Creative Chess Puzzles
by: Feng, Xidong, et al.
Published: (2025)
by: Feng, Xidong, et al.
Published: (2025)
Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks
by: Zhou, Yongxi, et al.
Published: (2026)
by: Zhou, Yongxi, et al.
Published: (2026)
Enhanced Semantic Segmentation Pipeline for WeatherProof Dataset Challenge
by: Zhang, Nan, et al.
Published: (2024)
by: Zhang, Nan, et al.
Published: (2024)
Similar Items
-
ChessQA: Evaluating Large Language Models for Chess Understanding
by: Wen, Qianfeng, et al.
Published: (2025) -
ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models
by: Liu, Jincheng, et al.
Published: (2025) -
Assisting Research Proposal Writing with Large Language Models: Evaluation and Refinement
by: Ren, Jing, et al.
Published: (2025) -
Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability
by: Jiang, Xinyan, et al.
Published: (2026) -
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
by: Mondorf, Philipp, et al.
Published: (2024)