Saved in:
| Main Authors: | Yuan, Jiarui, Jin, Tailin, Chen, Weize, Liu, Zeyuan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.04811 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Physics of Language Models: Part 3.2, Knowledge Manipulation
by: Allen-Zhu, Zeyuan, et al.
Published: (2023)
by: Allen-Zhu, Zeyuan, et al.
Published: (2023)
Physics of Language Models: Part 3.1, Knowledge Storage and Extraction
by: Allen-Zhu, Zeyuan, et al.
Published: (2023)
by: Allen-Zhu, Zeyuan, et al.
Published: (2023)
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
by: Allen-Zhu, Zeyuan, et al.
Published: (2024)
by: Allen-Zhu, Zeyuan, et al.
Published: (2024)
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models
by: Jin, Zhuoran, et al.
Published: (2024)
by: Jin, Zhuoran, et al.
Published: (2024)
EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective
by: Wang, Yuyao, et al.
Published: (2026)
by: Wang, Yuyao, et al.
Published: (2026)
AssertBench: A Benchmark for Evaluating Self-Assertion in Large Language Models
by: Lee, Jaeho, et al.
Published: (2025)
by: Lee, Jaeho, et al.
Published: (2025)
Co-Evolution of Policy and Internal Reward for Language Agents
by: Wang, Xinyu, et al.
Published: (2026)
by: Wang, Xinyu, et al.
Published: (2026)
Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration
by: Deng, Wenhao, et al.
Published: (2025)
by: Deng, Wenhao, et al.
Published: (2025)
CriticBench: Benchmarking LLMs for Critique-Correct Reasoning
by: Lin, Zicheng, et al.
Published: (2024)
by: Lin, Zicheng, et al.
Published: (2024)
Pre-training Limited Memory Language Models with Internal and External Knowledge
by: Zhao, Linxi, et al.
Published: (2025)
by: Zhao, Linxi, et al.
Published: (2025)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning
by: Zhou, Zhi, et al.
Published: (2025)
by: Zhou, Zhi, et al.
Published: (2025)
ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
by: Karger, Ezra, et al.
Published: (2024)
by: Karger, Ezra, et al.
Published: (2024)
VL-RouterBench: A Benchmark for Vision-Language Model Routing
by: Huang, Zhehao, et al.
Published: (2025)
by: Huang, Zhehao, et al.
Published: (2025)
AlignBench: Benchmarking Chinese Alignment of Large Language Models
by: Liu, Xiao, et al.
Published: (2023)
by: Liu, Xiao, et al.
Published: (2023)
Bench to the Future: A Pastcasting Benchmark for Forecasting Agents
by: FutureSearch, et al.
Published: (2025)
by: FutureSearch, et al.
Published: (2025)
CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures
by: Pandey, Punya Syon, et al.
Published: (2025)
by: Pandey, Punya Syon, et al.
Published: (2025)
EscapeBench: Towards Advancing Creative Intelligence of Language Model Agents
by: Qian, Cheng, et al.
Published: (2024)
by: Qian, Cheng, et al.
Published: (2024)
100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?
by: Yang, Wang, et al.
Published: (2025)
by: Yang, Wang, et al.
Published: (2025)
HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation
by: Liu, Haokun, et al.
Published: (2025)
by: Liu, Haokun, et al.
Published: (2025)
UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches
by: Wang, Chao, et al.
Published: (2024)
by: Wang, Chao, et al.
Published: (2024)
Confidence-aware Self-Semantic Distillation on Knowledge Graph Embedding
by: Liu, Yichen, et al.
Published: (2022)
by: Liu, Yichen, et al.
Published: (2022)
JudgeBench: A Benchmark for Evaluating LLM-based Judges
by: Tan, Sijun, et al.
Published: (2024)
by: Tan, Sijun, et al.
Published: (2024)
DataSciBench: An LLM Agent Benchmark for Data Science
by: Zhang, Dan, et al.
Published: (2025)
by: Zhang, Dan, et al.
Published: (2025)
ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure
by: Fujisawa, Ippei, et al.
Published: (2024)
by: Fujisawa, Ippei, et al.
Published: (2024)
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
by: White, Colin, et al.
Published: (2024)
by: White, Colin, et al.
Published: (2024)
BenchAgents: Multi-Agent Systems for Structured Benchmark Creation
by: Butt, Natasha, et al.
Published: (2024)
by: Butt, Natasha, et al.
Published: (2024)
Agentic Critical Training
by: Liu, Weize, et al.
Published: (2026)
by: Liu, Weize, et al.
Published: (2026)
Physics of Language Models: Part 1, Learning Hierarchical Language Structures
by: Allen-Zhu, Zeyuan, et al.
Published: (2023)
by: Allen-Zhu, Zeyuan, et al.
Published: (2023)
ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation
by: Liu, Qin, et al.
Published: (2025)
by: Liu, Qin, et al.
Published: (2025)
SelfReflect: Can LLMs Communicate Their Internal Answer Distribution?
by: Kirchhof, Michael, et al.
Published: (2025)
by: Kirchhof, Michael, et al.
Published: (2025)
LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks
by: Long, Xiang, et al.
Published: (2026)
by: Long, Xiang, et al.
Published: (2026)
SELF: Self-Evolution with Language Feedback
by: Lu, Jianqiao, et al.
Published: (2023)
by: Lu, Jianqiao, et al.
Published: (2023)
AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability
by: Yang, Siwei, et al.
Published: (2024)
by: Yang, Siwei, et al.
Published: (2024)
SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning
by: He, Zelin, et al.
Published: (2026)
by: He, Zelin, et al.
Published: (2026)
MileBench: Benchmarking MLLMs in Long Context
by: Song, Dingjie, et al.
Published: (2024)
by: Song, Dingjie, et al.
Published: (2024)
EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking
by: Wei, Anjiang, et al.
Published: (2025)
by: Wei, Anjiang, et al.
Published: (2025)
MOSLD-Bench: Multilingual Open-Set Learning and Discovery Benchmark for Text Categorization
by: Costache, Adriana-Valentina, et al.
Published: (2026)
by: Costache, Adriana-Valentina, et al.
Published: (2026)
NoiseBench: Benchmarking the Impact of Real Label Noise on Named Entity Recognition
by: Merdjanovska, Elena, et al.
Published: (2024)
by: Merdjanovska, Elena, et al.
Published: (2024)
seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs
by: Ramezanali, Mohammad, et al.
Published: (2025)
by: Ramezanali, Mohammad, et al.
Published: (2025)
BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation
by: Kim, Eunsu, et al.
Published: (2025)
by: Kim, Eunsu, et al.
Published: (2025)
Similar Items
-
Physics of Language Models: Part 3.2, Knowledge Manipulation
by: Allen-Zhu, Zeyuan, et al.
Published: (2023) -
Physics of Language Models: Part 3.1, Knowledge Storage and Extraction
by: Allen-Zhu, Zeyuan, et al.
Published: (2023) -
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
by: Allen-Zhu, Zeyuan, et al.
Published: (2024) -
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models
by: Jin, Zhuoran, et al.
Published: (2024) -
EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective
by: Wang, Yuyao, et al.
Published: (2026)