Gespeichert in:
Bibliographische Detailangaben
1. Verfasser: Shalom Lijo, Solomon
Format: Recurso digital
Sprache:Englisch
Veröffentlicht: Zenodo 2026
Schlagworte:
Online-Zugang:https://doi.org/10.5281/zenodo.19919043
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Inhaltsangabe:
  • <p>The dominant paradigm in large language model (LLM) evaluation reports accuracy on fixed task batteries. This paradigm is silent on the question that most determines whether AI systems are worth deploying: do they save more time and resources than they consume? A growing body of empirical work has begun to answer this question and produced sharply conflicting results. Field experiments with GitHub Copilot report bounded coding tasks completed up to 55.8% faster (Peng et al., 2023); a Boston Consulting Group field study found consultants 25% faster on tasks within AI's frontier and 19 percentage points less likely to be correct on tasks outside it (Dell'Acqua et al., 2023); a 2025 randomized controlled trial of experienced open-source developers using Cursor Pro and Claude 3.5/3.7 Sonnet found that AI assistance made them 19% slower, while developers still believed AI had sped them up by 20% (Becker et al., 2025). None of the major capability benchmarks—MMLU, GPQA, HumanEval, GDPval, SWE-Lancer, METR Time Horizons—report a metric that resolves this contradiction. As a position-paper proposal, we introduce TSB (Time-Saved Benchmark), a five-component framework that any benchmark can adopt to report net productivity impact alongside accuracy: Net Time-Saved Ratio (NTSR), Reliability-Adjusted Net Time-Saved Ratio (RANTSR), Resource Efficiency Ratio (RER), a Waste-Mode Taxonomy of seven failure categories with annotator decision rules, and a Frontier Position Indicator (FPI) that reports whether a task lies inside or outside the system's reliable-completion frontier. We formalize each metric, specify a measurement protocol that captures verification and rework overhead, and illustrate the framework via retroactive case studies of the four published evaluations above. The case studies suggest that headline figures may collapse substantially under reliability adjustment—GDPval's 100× inference-time advantage reduces to a ~4× deployed speedup, and BCG's outside-frontier 25% nominal speedup falls to a near-zero or negative reliability-adjusted contribution—with sensitivity analysis showing the qualitative findings robust to the assumptions we import. Because the underlying instrumentation does not yet exist in published reports (the gap motivating this paper), the case studies are illustrations of the framework's diagnostic value, not definitive measurements; prospective TSB-instrumented validation is the natural next step. We argue that every major benchmark should report TSB-style metrics as a first-class result, and release the framework and a reference measurement protocol for community use.</p>