Saved in:
Bibliographic Details
Main Authors: Krumdick, Michael, Reddy, Varshini, Chaudhary, Shivani, Day, William, Ahmed, Maarij, Haqqi, Hayan, Fahim, Muhammad Ahsen, Amjad, Hanzallah, Orakzai, Ahmad, Gul, Aqsa, Tanner, Chris
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.05912
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917388379750400
author Krumdick, Michael
Reddy, Varshini
Chaudhary, Shivani
Day, William
Ahmed, Maarij
Haqqi, Hayan
Fahim, Muhammad Ahsen
Amjad, Hanzallah
Orakzai, Ahmad
Gul, Aqsa
Tanner, Chris
author_facet Krumdick, Michael
Reddy, Varshini
Chaudhary, Shivani
Day, William
Ahmed, Maarij
Haqqi, Hayan
Fahim, Muhammad Ahsen
Amjad, Hanzallah
Orakzai, Ahmad
Gul, Aqsa
Tanner, Chris
contents As concerns surrounding AI-driven labor displacement intensify in knowledge-intensive sectors, existing benchmarks fail to measure performance on tasks that define practical professional expertise. Finance, in particular, has been identified as a domain with high AI exposure risk, yet lacks robust benchmarks to track real-world developments. This gap is compounded by the absence of clear accountability mechanisms in current Large Language Model (LLM) deployments. To address this, we introduce FrontierFinance, a long-horizon benchmark of 25 complex financial modeling tasks across five core finance models, requiring an average of over 18 hours of skilled human labor per task to complete. Developed with financial professionals, the benchmark reflects industry-standard financial modeling workflows and is paired with detailed rubrics for structured evaluation. We engage human experts to define the tasks, create rubrics, grade LLMs, and perform the tasks themselves as human baselines. We demonstrate that our human experts both receive higher scores on average, and are more likely to provide client-ready outputs than current state-of-the-art systems.
format Preprint
id arxiv_https___arxiv_org_abs_2604_05912
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
Krumdick, Michael
Reddy, Varshini
Chaudhary, Shivani
Day, William
Ahmed, Maarij
Haqqi, Hayan
Fahim, Muhammad Ahsen
Amjad, Hanzallah
Orakzai, Ahmad
Gul, Aqsa
Tanner, Chris
Computation and Language
As concerns surrounding AI-driven labor displacement intensify in knowledge-intensive sectors, existing benchmarks fail to measure performance on tasks that define practical professional expertise. Finance, in particular, has been identified as a domain with high AI exposure risk, yet lacks robust benchmarks to track real-world developments. This gap is compounded by the absence of clear accountability mechanisms in current Large Language Model (LLM) deployments. To address this, we introduce FrontierFinance, a long-horizon benchmark of 25 complex financial modeling tasks across five core finance models, requiring an average of over 18 hours of skilled human labor per task to complete. Developed with financial professionals, the benchmark reflects industry-standard financial modeling workflows and is paired with detailed rubrics for structured evaluation. We engage human experts to define the tasks, create rubrics, grade LLMs, and perform the tasks themselves as human baselines. We demonstrate that our human experts both receive higher scores on average, and are more likely to provide client-ready outputs than current state-of-the-art systems.
title FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
topic Computation and Language
url https://arxiv.org/abs/2604.05912