Saved in:
| Main Author: | |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.04562 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866908478323294208 |
|---|---|
| author | Lu, Janna |
| author_facet | Lu, Janna |
| contents | Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their ability to forecast future events remains understudied. A year ago, large language models struggle to come close to the accuracy of a human crowd. I evaluate state-of-the-art LLMs on 464 forecasting questions from Metaculus, comparing their performance against top forecasters. Frontier models achieve Brier scores that ostensibly surpass the human crowd but still significantly underperform a group of experts. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2507_04562 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Evaluating LLMs on Real-World Forecasting Against Expert Forecasters Lu, Janna Machine Learning Artificial Intelligence Computation and Language Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their ability to forecast future events remains understudied. A year ago, large language models struggle to come close to the accuracy of a human crowd. I evaluate state-of-the-art LLMs on 464 forecasting questions from Metaculus, comparing their performance against top forecasters. Frontier models achieve Brier scores that ostensibly surpass the human crowd but still significantly underperform a group of experts. |
| title | Evaluating LLMs on Real-World Forecasting Against Expert Forecasters |
| topic | Machine Learning Artificial Intelligence Computation and Language |
| url | https://arxiv.org/abs/2507.04562 |