_version_ 1866908666430488576
author Liu, Hongwei
Liu, Junnan
Liu, Shudong
Duan, Haodong
Li, Yuqiang
Su, Mao
Liu, Xiaohong
Zhai, Guangtao
Fang, Xinyu
Ma, Qianhong
Zhang, Taolin
Ma, Zihan
Zhao, Yufeng
Zhou, Peiheng
Xiao, Linchen
Zhang, Wenlong
Zhou, Shijie
Ma, Xingjian
Sun, Siqi
Ge, Jiaye
Li, Meng
Liu, Yuhong
Dong, Jianxin
Li, Jiaying
Wu, Hui
Liang, Hanwen
Lin, Jintai
Wang, Yanting
Dong, Jie
Zhu, Tong
Fu, Tianfan
He, Conghui
Zhang, Qi
Zhang, Songyang
Bai, Lei
Chen, Kai
author_facet Liu, Hongwei
Liu, Junnan
Liu, Shudong
Duan, Haodong
Li, Yuqiang
Su, Mao
Liu, Xiaohong
Zhai, Guangtao
Fang, Xinyu
Ma, Qianhong
Zhang, Taolin
Ma, Zihan
Zhao, Yufeng
Zhou, Peiheng
Xiao, Linchen
Zhang, Wenlong
Zhou, Shijie
Ma, Xingjian
Sun, Siqi
Ge, Jiaye
Li, Meng
Liu, Yuhong
Dong, Jianxin
Li, Jiaying
Wu, Hui
Liang, Hanwen
Lin, Jintai
Wang, Yanting
Dong, Jie
Zhu, Tong
Fu, Tianfan
He, Conghui
Zhang, Qi
Zhang, Songyang
Bai, Lei
Chen, Kai
contents The rapid advancement of Large Language Models (LLMs) has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high-difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real-world scientific inquiry. To address these challenges, we introduce ATLAS (AGI-Oriented Testbed for Logical Application in Science), a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD-level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage; (2) Cross-Disciplinary Focus, designed to assess models' ability to integrate knowledge and reason across scientific domains; (3) High-Fidelity Answers, prioritizing complex, open-ended answers involving multi-step reasoning and LaTeX-formatted expressions over simple multiple-choice questions; and (4) Rigorous Quality Control, employing a multi-stage process of expert peer review and adversarial testing to ensure question difficulty, scientific value, and correctness. We also propose a robust evaluation paradigm using a panel of LLM judges for automated, nuanced assessment of complex answers. Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities. We plan to develop ATLAS into a long-term, open, community-driven platform to provide a reliable "ruler" for progress toward Artificial General Intelligence.
format Preprint
id arxiv_https___arxiv_org_abs_2511_14366
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
Liu, Hongwei
Liu, Junnan
Liu, Shudong
Duan, Haodong
Li, Yuqiang
Su, Mao
Liu, Xiaohong
Zhai, Guangtao
Fang, Xinyu
Ma, Qianhong
Zhang, Taolin
Ma, Zihan
Zhao, Yufeng
Zhou, Peiheng
Xiao, Linchen
Zhang, Wenlong
Zhou, Shijie
Ma, Xingjian
Sun, Siqi
Ge, Jiaye
Li, Meng
Liu, Yuhong
Dong, Jianxin
Li, Jiaying
Wu, Hui
Liang, Hanwen
Lin, Jintai
Wang, Yanting
Dong, Jie
Zhu, Tong
Fu, Tianfan
He, Conghui
Zhang, Qi
Zhang, Songyang
Bai, Lei
Chen, Kai
Computation and Language
The rapid advancement of Large Language Models (LLMs) has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high-difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real-world scientific inquiry. To address these challenges, we introduce ATLAS (AGI-Oriented Testbed for Logical Application in Science), a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD-level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage; (2) Cross-Disciplinary Focus, designed to assess models' ability to integrate knowledge and reason across scientific domains; (3) High-Fidelity Answers, prioritizing complex, open-ended answers involving multi-step reasoning and LaTeX-formatted expressions over simple multiple-choice questions; and (4) Rigorous Quality Control, employing a multi-stage process of expert peer review and adversarial testing to ensure question difficulty, scientific value, and correctness. We also propose a robust evaluation paradigm using a panel of LLM judges for automated, nuanced assessment of complex answers. Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities. We plan to develop ATLAS into a long-term, open, community-driven platform to provide a reliable "ruler" for progress toward Artificial General Intelligence.
title ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
topic Computation and Language
url https://arxiv.org/abs/2511.14366