Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Dai, Zeshi, Peng, Zimo, Cheng, Zerui, Li, Ryan Yihe
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence Computational Engineering, Finance, and Science I.6.4; I.2.1
Online Access:	https://arxiv.org/abs/2510.00332
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914259750879232
author	Dai, Zeshi Peng, Zimo Cheng, Zerui Li, Ryan Yihe
author_facet	Dai, Zeshi Peng, Zimo Cheng, Zerui Li, Ryan Yihe
contents	We present CAIA, a benchmark exposing a critical blind spot in AI evaluation: the inability of state-of-the-art models to operate in adversarial, high-stakes environments where misinformation is weaponized and errors are irreversible. While existing benchmarks measure task completion in controlled settings, real-world deployment demands resilience against active deception. Using crypto markets as a testbed where $30 billion was lost to exploits in 2024, we evaluate 17 models on 178 time-anchored tasks requiring agents to distinguish truth from manipulation, navigate fragmented information landscapes, and make irreversible financial decisions under adversarial pressure. Our results reveal a fundamental capability gap: without tools, even frontier models achieve only 28% accuracy on tasks junior analysts routinely handle. Tool augmentation improves performance but plateaus at 67.4% versus 80% human baseline, despite unlimited access to professional resources. Most critically, we uncover a systematic tool selection catastrophe: models preferentially choose unreliable web search over authoritative data, falling for SEO-optimized misinformation and social media manipulation. This behavior persists even when correct answers are directly accessible through specialized tools, suggesting foundational limitations rather than knowledge gaps. We also find that Pass@k metrics mask dangerous trial-and-error behavior for autonomous deployment. The implications extend beyond crypto to any domain with active adversaries, e.g. cybersecurity, content moderation, etc. We release CAIA with contamination controls and continuous updates, establishing adversarial robustness as a necessary condition for trustworthy AI autonomy. The benchmark reveals that current models, despite impressive reasoning scores, remain fundamentally unprepared for environments where intelligence must survive active opposition.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_00332
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets Dai, Zeshi Peng, Zimo Cheng, Zerui Li, Ryan Yihe Artificial Intelligence Computational Engineering, Finance, and Science I.6.4; I.2.1 We present CAIA, a benchmark exposing a critical blind spot in AI evaluation: the inability of state-of-the-art models to operate in adversarial, high-stakes environments where misinformation is weaponized and errors are irreversible. While existing benchmarks measure task completion in controlled settings, real-world deployment demands resilience against active deception. Using crypto markets as a testbed where $30 billion was lost to exploits in 2024, we evaluate 17 models on 178 time-anchored tasks requiring agents to distinguish truth from manipulation, navigate fragmented information landscapes, and make irreversible financial decisions under adversarial pressure. Our results reveal a fundamental capability gap: without tools, even frontier models achieve only 28% accuracy on tasks junior analysts routinely handle. Tool augmentation improves performance but plateaus at 67.4% versus 80% human baseline, despite unlimited access to professional resources. Most critically, we uncover a systematic tool selection catastrophe: models preferentially choose unreliable web search over authoritative data, falling for SEO-optimized misinformation and social media manipulation. This behavior persists even when correct answers are directly accessible through specialized tools, suggesting foundational limitations rather than knowledge gaps. We also find that Pass@k metrics mask dangerous trial-and-error behavior for autonomous deployment. The implications extend beyond crypto to any domain with active adversaries, e.g. cybersecurity, content moderation, etc. We release CAIA with contamination controls and continuous updates, establishing adversarial robustness as a necessary condition for trustworthy AI autonomy. The benchmark reveals that current models, despite impressive reasoning scores, remain fundamentally unprepared for environments where intelligence must survive active opposition.
title	When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets
topic	Artificial Intelligence Computational Engineering, Finance, and Science I.6.4; I.2.1
url	https://arxiv.org/abs/2510.00332

Similar Items