Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Hongwei, Liu, Junnan, Liu, Shudong, Duan, Haodong, Li, Yuqiang, Su, Mao, Liu, Xiaohong, Zhai, Guangtao, Fang, Xinyu, Ma, Qianhong, Zhang, Taolin, Ma, Zihan, Zhao, Yufeng, Zhou, Peiheng, Xiao, Linchen, Zhang, Wenlong, Zhou, Shijie, Ma, Xingjian, Sun, Siqi, Ge, Jiaye, Li, Meng, Liu, Yuhong, Dong, Jianxin, Li, Jiaying, Wu, Hui, Liang, Hanwen, Lin, Jintai, Wang, Yanting, Dong, Jie, Zhu, Tong, Fu, Tianfan, He, Conghui, Zhang, Qi, Zhang, Songyang, Bai, Lei, Chen, Kai
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2511.14366
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908666430488576
author	Liu, Hongwei Liu, Junnan Liu, Shudong Duan, Haodong Li, Yuqiang Su, Mao Liu, Xiaohong Zhai, Guangtao Fang, Xinyu Ma, Qianhong Zhang, Taolin Ma, Zihan Zhao, Yufeng Zhou, Peiheng Xiao, Linchen Zhang, Wenlong Zhou, Shijie Ma, Xingjian Sun, Siqi Ge, Jiaye Li, Meng Liu, Yuhong Dong, Jianxin Li, Jiaying Wu, Hui Liang, Hanwen Lin, Jintai Wang, Yanting Dong, Jie Zhu, Tong Fu, Tianfan He, Conghui Zhang, Qi Zhang, Songyang Bai, Lei Chen, Kai
author_facet	Liu, Hongwei Liu, Junnan Liu, Shudong Duan, Haodong Li, Yuqiang Su, Mao Liu, Xiaohong Zhai, Guangtao Fang, Xinyu Ma, Qianhong Zhang, Taolin Ma, Zihan Zhao, Yufeng Zhou, Peiheng Xiao, Linchen Zhang, Wenlong Zhou, Shijie Ma, Xingjian Sun, Siqi Ge, Jiaye Li, Meng Liu, Yuhong Dong, Jianxin Li, Jiaying Wu, Hui Liang, Hanwen Lin, Jintai Wang, Yanting Dong, Jie Zhu, Tong Fu, Tianfan He, Conghui Zhang, Qi Zhang, Songyang Bai, Lei Chen, Kai
contents	The rapid advancement of Large Language Models (LLMs) has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high-difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real-world scientific inquiry. To address these challenges, we introduce ATLAS (AGI-Oriented Testbed for Logical Application in Science), a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD-level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage; (2) Cross-Disciplinary Focus, designed to assess models' ability to integrate knowledge and reason across scientific domains; (3) High-Fidelity Answers, prioritizing complex, open-ended answers involving multi-step reasoning and LaTeX-formatted expressions over simple multiple-choice questions; and (4) Rigorous Quality Control, employing a multi-stage process of expert peer review and adversarial testing to ensure question difficulty, scientific value, and correctness. We also propose a robust evaluation paradigm using a panel of LLM judges for automated, nuanced assessment of complex answers. Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities. We plan to develop ATLAS into a long-term, open, community-driven platform to provide a reliable "ruler" for progress toward Artificial General Intelligence.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_14366
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning Liu, Hongwei Liu, Junnan Liu, Shudong Duan, Haodong Li, Yuqiang Su, Mao Liu, Xiaohong Zhai, Guangtao Fang, Xinyu Ma, Qianhong Zhang, Taolin Ma, Zihan Zhao, Yufeng Zhou, Peiheng Xiao, Linchen Zhang, Wenlong Zhou, Shijie Ma, Xingjian Sun, Siqi Ge, Jiaye Li, Meng Liu, Yuhong Dong, Jianxin Li, Jiaying Wu, Hui Liang, Hanwen Lin, Jintai Wang, Yanting Dong, Jie Zhu, Tong Fu, Tianfan He, Conghui Zhang, Qi Zhang, Songyang Bai, Lei Chen, Kai Computation and Language The rapid advancement of Large Language Models (LLMs) has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high-difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real-world scientific inquiry. To address these challenges, we introduce ATLAS (AGI-Oriented Testbed for Logical Application in Science), a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD-level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage; (2) Cross-Disciplinary Focus, designed to assess models' ability to integrate knowledge and reason across scientific domains; (3) High-Fidelity Answers, prioritizing complex, open-ended answers involving multi-step reasoning and LaTeX-formatted expressions over simple multiple-choice questions; and (4) Rigorous Quality Control, employing a multi-stage process of expert peer review and adversarial testing to ensure question difficulty, scientific value, and correctness. We also propose a robust evaluation paradigm using a panel of LLM judges for automated, nuanced assessment of complex answers. Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities. We plan to develop ATLAS into a long-term, open, community-driven platform to provide a reliable "ruler" for progress toward Artificial General Intelligence.
title	ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
topic	Computation and Language
url	https://arxiv.org/abs/2511.14366

Similar Items