_version_ 1866908669836263424
author Zhong, Tianyang
Liu, Zhengliang
Pan, Yi
Zhang, Yutong
Zhang, Zeyu
Zhou, Yifan
Liang, Shizhe
Wu, Zihao
Lyu, Yanjun
Shu, Peng
Yu, Xiaowei
Cao, Chao
Jiang, Hanqi
Chen, Hanxu
Li, Yiwei
Chen, Junhao
Hu, Huawen
Liu, Yiheng
Zhao, Huaqin
Xu, Shaochen
Dai, Haixing
Zhao, Lin
Zhang, Ruidong
Zhao, Wei
Yang, Zhenyuan
Chen, Jingyuan
Wang, Peilong
Ruan, Wei
Wang, Hui
Zhao, Huan
Zhang, Jing
Ren, Yiming
Qin, Shihuan
Chen, Tong
Li, Jiaxi
Zidan, Arif Hassan
Jahin, Afrar
Chen, Minheng
Xia, Sichen
Holmes, Jason
Zhuang, Yan
Wang, Jiaqi
Xu, Bochen
Xia, Weiran
Yu, Jichao
Tang, Kaibo
Yang, Yaxuan
Sun, Bolun
Yang, Tao
Lu, Guoyu
Wang, Xianqiao
Chai, Lilong
Li, He
Lu, Jin
Zhang, Xin
Ge, Bao
Hu, Xintao
Zhang, Lian
Zhou, Hua
Zhang, Lu
Zhang, Shu
Xiang, Zhen
Ren, Yudan
Liu, Jun
Jiang, Xi
Bao, Yu
Zhang, Wei
Li, Xiang
Li, Gang
Liu, Wei
Shen, Dinggang
Sikora, Andrea
Zhai, Xiaoming
Zhu, Dajiang
Zhang, Tuo
Liu, Tianming
author_facet Zhong, Tianyang
Liu, Zhengliang
Pan, Yi
Zhang, Yutong
Zhang, Zeyu
Zhou, Yifan
Liang, Shizhe
Wu, Zihao
Lyu, Yanjun
Shu, Peng
Yu, Xiaowei
Cao, Chao
Jiang, Hanqi
Chen, Hanxu
Li, Yiwei
Chen, Junhao
Hu, Huawen
Liu, Yiheng
Zhao, Huaqin
Xu, Shaochen
Dai, Haixing
Zhao, Lin
Zhang, Ruidong
Zhao, Wei
Yang, Zhenyuan
Chen, Jingyuan
Wang, Peilong
Ruan, Wei
Wang, Hui
Zhao, Huan
Zhang, Jing
Ren, Yiming
Qin, Shihuan
Chen, Tong
Li, Jiaxi
Zidan, Arif Hassan
Jahin, Afrar
Chen, Minheng
Xia, Sichen
Holmes, Jason
Zhuang, Yan
Wang, Jiaqi
Xu, Bochen
Xia, Weiran
Yu, Jichao
Tang, Kaibo
Yang, Yaxuan
Sun, Bolun
Yang, Tao
Lu, Guoyu
Wang, Xianqiao
Chai, Lilong
Li, He
Lu, Jin
Zhang, Xin
Ge, Bao
Hu, Xintao
Zhang, Lian
Zhou, Hua
Zhang, Lu
Zhang, Shu
Xiang, Zhen
Ren, Yudan
Liu, Jun
Jiang, Xi
Bao, Yu
Zhang, Wei
Li, Xiang
Li, Gang
Liu, Wei
Shen, Dinggang
Sikora, Andrea
Zhai, Xiaoming
Zhu, Dajiang
Zhang, Tuo
Liu, Tianming
contents This comprehensive study evaluates the performance of OpenAI's o1-preview large language model across a diverse array of complex reasoning tasks, spanning multiple domains, including computer science, mathematics, natural sciences, medicine, linguistics, and social sciences. Through rigorous testing, o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance in areas ranging from coding challenges to scientific reasoning and from language processing to creative problem-solving. Key findings include: -83.3% success rate in solving complex competitive programming problems, surpassing many human experts. -Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models. -100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions. -Advanced natural language inference capabilities across general and specialized domains like medicine. -Impressive performance in chip design tasks, outperforming specialized models in areas such as EDA script generation and bug analysis. -Remarkable proficiency in anthropology and geology, demonstrating deep understanding and reasoning in these specialized fields. -Strong capabilities in quantitative investing. O1 has comprehensive financial knowledge and statistical modeling skills. -Effective performance in social media analysis, including sentiment analysis and emotion recognition. The model excelled particularly in tasks requiring intricate reasoning and knowledge integration across various fields. While some limitations were observed, including occasional errors on simpler problems and challenges with certain highly specialized concepts, the overall results indicate significant progress towards artificial general intelligence.
format Preprint
id arxiv_https___arxiv_org_abs_2409_18486
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Evaluation of OpenAI o1: Opportunities and Challenges of AGI
Zhong, Tianyang
Liu, Zhengliang
Pan, Yi
Zhang, Yutong
Zhang, Zeyu
Zhou, Yifan
Liang, Shizhe
Wu, Zihao
Lyu, Yanjun
Shu, Peng
Yu, Xiaowei
Cao, Chao
Jiang, Hanqi
Chen, Hanxu
Li, Yiwei
Chen, Junhao
Hu, Huawen
Liu, Yiheng
Zhao, Huaqin
Xu, Shaochen
Dai, Haixing
Zhao, Lin
Zhang, Ruidong
Zhao, Wei
Yang, Zhenyuan
Chen, Jingyuan
Wang, Peilong
Ruan, Wei
Wang, Hui
Zhao, Huan
Zhang, Jing
Ren, Yiming
Qin, Shihuan
Chen, Tong
Li, Jiaxi
Zidan, Arif Hassan
Jahin, Afrar
Chen, Minheng
Xia, Sichen
Holmes, Jason
Zhuang, Yan
Wang, Jiaqi
Xu, Bochen
Xia, Weiran
Yu, Jichao
Tang, Kaibo
Yang, Yaxuan
Sun, Bolun
Yang, Tao
Lu, Guoyu
Wang, Xianqiao
Chai, Lilong
Li, He
Lu, Jin
Zhang, Xin
Ge, Bao
Hu, Xintao
Zhang, Lian
Zhou, Hua
Zhang, Lu
Zhang, Shu
Xiang, Zhen
Ren, Yudan
Liu, Jun
Jiang, Xi
Bao, Yu
Zhang, Wei
Li, Xiang
Li, Gang
Liu, Wei
Shen, Dinggang
Sikora, Andrea
Zhai, Xiaoming
Zhu, Dajiang
Zhang, Tuo
Liu, Tianming
Computation and Language
This comprehensive study evaluates the performance of OpenAI's o1-preview large language model across a diverse array of complex reasoning tasks, spanning multiple domains, including computer science, mathematics, natural sciences, medicine, linguistics, and social sciences. Through rigorous testing, o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance in areas ranging from coding challenges to scientific reasoning and from language processing to creative problem-solving. Key findings include: -83.3% success rate in solving complex competitive programming problems, surpassing many human experts. -Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models. -100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions. -Advanced natural language inference capabilities across general and specialized domains like medicine. -Impressive performance in chip design tasks, outperforming specialized models in areas such as EDA script generation and bug analysis. -Remarkable proficiency in anthropology and geology, demonstrating deep understanding and reasoning in these specialized fields. -Strong capabilities in quantitative investing. O1 has comprehensive financial knowledge and statistical modeling skills. -Effective performance in social media analysis, including sentiment analysis and emotion recognition. The model excelled particularly in tasks requiring intricate reasoning and knowledge integration across various fields. While some limitations were observed, including occasional errors on simpler problems and challenges with certain highly specialized concepts, the overall results indicate significant progress towards artificial general intelligence.
title Evaluation of OpenAI o1: Opportunities and Challenges of AGI
topic Computation and Language
url https://arxiv.org/abs/2409.18486