Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.21821 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913150465474560 |
|---|---|
| author | Feng, Kaiyue Zhao, Yilun Liu, Yixin Yang, Tianyu Zhao, Chen Sous, John Cohan, Arman |
| author_facet | Feng, Kaiyue Zhao, Yilun Liu, Yixin Yang, Tianyu Zhao, Chen Sous, John Cohan, Arman |
| contents | We introduce PHYSICS, a comprehensive benchmark for university-level physics problem solving. It contains 1297 expert-annotated problems covering six core areas: classical mechanics, quantum mechanics, thermodynamics and statistical mechanics, electromagnetism, atomic physics, and optics. Each problem requires advanced physics knowledge and mathematical reasoning. We develop a robust automated evaluation system for precise and reliable validation. Our evaluation of leading foundation models reveals substantial limitations. Even the most advanced model, o3-mini, achieves only 59.9% accuracy, highlighting significant challenges in solving high-level scientific problems. Through comprehensive error analysis, exploration of diverse prompting strategies, and Retrieval-Augmented Generation (RAG)-based knowledge augmentation, we identify key areas for improvement, laying the foundation for future advancements. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2503_21821 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving Feng, Kaiyue Zhao, Yilun Liu, Yixin Yang, Tianyu Zhao, Chen Sous, John Cohan, Arman Artificial Intelligence We introduce PHYSICS, a comprehensive benchmark for university-level physics problem solving. It contains 1297 expert-annotated problems covering six core areas: classical mechanics, quantum mechanics, thermodynamics and statistical mechanics, electromagnetism, atomic physics, and optics. Each problem requires advanced physics knowledge and mathematical reasoning. We develop a robust automated evaluation system for precise and reliable validation. Our evaluation of leading foundation models reveals substantial limitations. Even the most advanced model, o3-mini, achieves only 59.9% accuracy, highlighting significant challenges in solving high-level scientific problems. Through comprehensive error analysis, exploration of diverse prompting strategies, and Retrieval-Augmented Generation (RAG)-based knowledge augmentation, we identify key areas for improvement, laying the foundation for future advancements. |
| title | PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving |
| topic | Artificial Intelligence |
| url | https://arxiv.org/abs/2503.21821 |