Saved in:
| Main Authors: | Chen, Junhao, Sun, Jingbo, Li, Xiang, Xin, Haidong, Xue, Yuhao, Xu, Yibin, Zhao, Hao |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.16610 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
TransportationGames: Benchmarking Transportation Knowledge of (Multimodal) Large Language Models
by: Zhang, Xue, et al.
Published: (2024)
by: Zhang, Xue, et al.
Published: (2024)
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations
by: Duan, Jinhao, et al.
Published: (2024)
by: Duan, Jinhao, et al.
Published: (2024)
Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models
by: Zhao, Runsong, et al.
Published: (2024)
by: Zhao, Runsong, et al.
Published: (2024)
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents
by: Sun, Haochen, et al.
Published: (2025)
by: Sun, Haochen, et al.
Published: (2025)
ICLEval: Evaluating In-Context Learning Ability of Large Language Models
by: Chen, Wentong, et al.
Published: (2024)
by: Chen, Wentong, et al.
Published: (2024)
Evaluating the Generation Capabilities of Large Chinese Language Models
by: Zeng, Hui, et al.
Published: (2023)
by: Zeng, Hui, et al.
Published: (2023)
Enhancing Large Language Models (LLMs) for Telecommunications using Knowledge Graphs and Retrieval-Augmented Generation
by: Yuan, Dun, et al.
Published: (2025)
by: Yuan, Dun, et al.
Published: (2025)
Focus-LIME: Surgical Interpretation of Long-Context Large Language Models via Proxy-Based Neighborhood Selection
by: Liu, Junhao, et al.
Published: (2026)
by: Liu, Junhao, et al.
Published: (2026)
Multi-agent KTO: Reinforcing Strategic Interactions of Large Language Model in Language Game
by: Ye, Rong, et al.
Published: (2025)
by: Ye, Rong, et al.
Published: (2025)
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
by: Guo, Qianhong, et al.
Published: (2025)
by: Guo, Qianhong, et al.
Published: (2025)
OmniEduBench: A Comprehensive Chinese Benchmark for Evaluating Large Language Models in Education
by: Zhang, Min, et al.
Published: (2025)
by: Zhang, Min, et al.
Published: (2025)
Evaluating Large Language Models in Crisis Detection: A Real-World Benchmark from Psychological Support Hotlines
by: Deng, Guifeng, et al.
Published: (2025)
by: Deng, Guifeng, et al.
Published: (2025)
Large Language Models for Classical Chinese Poetry Translation: Benchmarking, Evaluating, and Improving
by: Chen, Andong, et al.
Published: (2024)
by: Chen, Andong, et al.
Published: (2024)
EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models
by: Chen, Yuyan, et al.
Published: (2024)
by: Chen, Yuyan, et al.
Published: (2024)
MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks
by: Ruan, Junhao, et al.
Published: (2026)
by: Ruan, Junhao, et al.
Published: (2026)
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models
by: Sun, Haoxiang, et al.
Published: (2025)
by: Sun, Haoxiang, et al.
Published: (2025)
Evaluating Counterfactual Strategic Reasoning in Large Language Models
by: Georgousis, Dimitrios, et al.
Published: (2026)
by: Georgousis, Dimitrios, et al.
Published: (2026)
ECKGBench: Benchmarking Large Language Models in E-commerce Leveraging Knowledge Graph
by: Liu, Langming, et al.
Published: (2025)
by: Liu, Langming, et al.
Published: (2025)
LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs
by: Wu, Yuhao, et al.
Published: (2024)
by: Wu, Yuhao, et al.
Published: (2024)
When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation
by: Jiang, Xunyi, et al.
Published: (2025)
by: Jiang, Xunyi, et al.
Published: (2025)
EarthSE: A Benchmark for Evaluating Earth Scientific Exploration Capability of LLMs
by: Xu, Wanghan, et al.
Published: (2025)
by: Xu, Wanghan, et al.
Published: (2025)
StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control
by: Zhao, Haishu, et al.
Published: (2026)
by: Zhao, Haishu, et al.
Published: (2026)
Semantic Triplet Restoration: A Novel Protocol for Hierarchical Table Understanding in Large Language Models
by: Zhao, Yibin, et al.
Published: (2026)
by: Zhao, Yibin, et al.
Published: (2026)
OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large Language Models
by: Liu, Yang, et al.
Published: (2024)
by: Liu, Yang, et al.
Published: (2024)
Foundations of Large Language Models
by: Xiao, Tong, et al.
Published: (2025)
by: Xiao, Tong, et al.
Published: (2025)
MathScape: Benchmarking Multimodal Large Language Models in Real-World Mathematical Contexts
by: Liang, Hao, et al.
Published: (2024)
by: Liang, Hao, et al.
Published: (2024)
Benchmarking and Rethinking Knowledge Editing for Large Language Models
by: He, Guoxiu, et al.
Published: (2025)
by: He, Guoxiu, et al.
Published: (2025)
BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models
by: Dong, Zican, et al.
Published: (2023)
by: Dong, Zican, et al.
Published: (2023)
Strategic Insights in Human and Large Language Model Tactics at Word Guessing Games
by: Rikters, Matīss, et al.
Published: (2024)
by: Rikters, Matīss, et al.
Published: (2024)
SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research
by: Sun, Liangtai, et al.
Published: (2023)
by: Sun, Liangtai, et al.
Published: (2023)
LLMs on Trial: Evaluating Judicial Fairness for Large Language Models
by: Hu, Yiran, et al.
Published: (2025)
by: Hu, Yiran, et al.
Published: (2025)
CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models
by: Qiu, Zexuan, et al.
Published: (2024)
by: Qiu, Zexuan, et al.
Published: (2024)
SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models
by: Xu, Zixiang, et al.
Published: (2025)
by: Xu, Zixiang, et al.
Published: (2025)
PoC: Performance-oriented Context Compression for Large Language Models via Performance Prediction
by: Zhao, Runsong, et al.
Published: (2026)
by: Zhao, Runsong, et al.
Published: (2026)
Marathon: A Race Through the Realm of Long Context with Large Language Models
by: Zhang, Lei, et al.
Published: (2023)
by: Zhang, Lei, et al.
Published: (2023)
$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens
by: Zhang, Xinrong, et al.
Published: (2024)
by: Zhang, Xinrong, et al.
Published: (2024)
Single-Cell Omics Arena: A Benchmark Study for Large Language Models on Cell Type Annotation Using Single-Cell Data
by: Liu, Junhao, et al.
Published: (2024)
by: Liu, Junhao, et al.
Published: (2024)
Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors
by: Liu, Xin, et al.
Published: (2025)
by: Liu, Xin, et al.
Published: (2025)
Evaluating Large Language Models for Radiology Natural Language Processing
by: Liu, Zhengliang, et al.
Published: (2023)
by: Liu, Zhengliang, et al.
Published: (2023)
On Context Utilization in Summarization with Large Language Models
by: Ravaut, Mathieu, et al.
Published: (2023)
by: Ravaut, Mathieu, et al.
Published: (2023)
Similar Items
-
TransportationGames: Benchmarking Transportation Knowledge of (Multimodal) Large Language Models
by: Zhang, Xue, et al.
Published: (2024) -
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations
by: Duan, Jinhao, et al.
Published: (2024) -
Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models
by: Zhao, Runsong, et al.
Published: (2024) -
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents
by: Sun, Haochen, et al.
Published: (2025) -
ICLEval: Evaluating In-Context Learning Ability of Large Language Models
by: Chen, Wentong, et al.
Published: (2024)