Saved in:
Bibliographic Details
Main Author: Mitra, Subhadip
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.28769
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915900749250560
author Mitra, Subhadip
author_facet Mitra, Subhadip
contents Evaluating large language models at scale remains a practical bottleneck for many organizations. While existing evaluation frameworks work well for thousands of examples, they struggle when datasets grow to hundreds of thousands or millions of samples. This scale is common when assessing model behavior across diverse domains or conducting comprehensive regression testing. We present Spark-LLM-Eval, a distributed evaluation framework built natively on Apache Spark. The system treats evaluation as a data-parallel problem, partitioningexamplesacrossexecutorsandaggregatingresultswithproperstatistical accounting. Beyond raw throughput, we emphasize statistical rigor: every reported metric includes bootstrap confidence intervals, and model comparisons come with appropriate significance tests (paired t-tests, McNemar's test, or Wilcoxon signed-rank, depending on the metric type). The framework also addresses the cost problem inherent in LLM evaluation through content-addressable response caching backed by Delta Lake, which allows iterating on metric definitions without re-running inference. We describe the system architecture, the statistical methodology, and report benchmark results showing linear scaling with cluster size. The framework and all evaluation code are available as open source.
format Preprint
id arxiv_https___arxiv_org_abs_2603_28769
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Spark-LLM-Eval: A Distributed Framework for Statistically Rigorous Large Language Model Evaluation
Mitra, Subhadip
Distributed, Parallel, and Cluster Computing
Computation and Language
Machine Learning
I.2.7; H.3.4; D.2.8
Evaluating large language models at scale remains a practical bottleneck for many organizations. While existing evaluation frameworks work well for thousands of examples, they struggle when datasets grow to hundreds of thousands or millions of samples. This scale is common when assessing model behavior across diverse domains or conducting comprehensive regression testing. We present Spark-LLM-Eval, a distributed evaluation framework built natively on Apache Spark. The system treats evaluation as a data-parallel problem, partitioningexamplesacrossexecutorsandaggregatingresultswithproperstatistical accounting. Beyond raw throughput, we emphasize statistical rigor: every reported metric includes bootstrap confidence intervals, and model comparisons come with appropriate significance tests (paired t-tests, McNemar's test, or Wilcoxon signed-rank, depending on the metric type). The framework also addresses the cost problem inherent in LLM evaluation through content-addressable response caching backed by Delta Lake, which allows iterating on metric definitions without re-running inference. We describe the system architecture, the statistical methodology, and report benchmark results showing linear scaling with cluster size. The framework and all evaluation code are available as open source.
title Spark-LLM-Eval: A Distributed Framework for Statistically Rigorous Large Language Model Evaluation
topic Distributed, Parallel, and Cluster Computing
Computation and Language
Machine Learning
I.2.7; H.3.4; D.2.8
url https://arxiv.org/abs/2603.28769