Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Xiang, Zhang, Tingting, Wang, Sen, Wu, Ying, Meng, Heng, Zhou, Peng, Li, Peng
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.28032
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918526650941440
author	Wang, Xiang Zhang, Tingting Wang, Sen Wu, Ying Meng, Heng Zhou, Peng Li, Peng
author_facet	Wang, Xiang Zhang, Tingting Wang, Sen Wu, Ying Meng, Heng Zhou, Peng Li, Peng
contents	Large Language Models are increasingly applied in the petroleum industry, highlighting the need for a domain-specific evaluation framework. This study develops a benchmark for LLMs in petroleum engineering, including a three-stage process of data preprocessing, quality filtering, and multi-model validation. Using expert review, a standardized question bank with strong domain relevance and discriminative capability was constructed. The benchmark covers production, reservoir, and drilling engineering, with 1,200 questions across multiple-choice, true or false, term definition, and short-answer formats. Eight mainstream LLMs were evaluated under a unified API environment. Results show that models performed better on subjective than objective questions, indicating weaknesses in factual knowledge discrimination. The highest accuracies for multiple-choice and true or false questions were 65.3% and 74.3%, respectively. Gemini-3-Pro, Kimi-K2.5, and Claude-Opus-4.6-Thinking achieved the best overall scores of 72%-74%. Models performed best in production engineering and weakest in reservoir engineering. Chinese models showed advantages in multiple-choice questions, while international models performed slightly better in short-answer questions. The benchmark provides a reproducible and practical reference for evaluating and deploying LLMs in petroleum engineering.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_28032
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	PetroBench: A Benchmark for Large Language Models in Petroleum Engineering Wang, Xiang Zhang, Tingting Wang, Sen Wu, Ying Meng, Heng Zhou, Peng Li, Peng Artificial Intelligence Large Language Models are increasingly applied in the petroleum industry, highlighting the need for a domain-specific evaluation framework. This study develops a benchmark for LLMs in petroleum engineering, including a three-stage process of data preprocessing, quality filtering, and multi-model validation. Using expert review, a standardized question bank with strong domain relevance and discriminative capability was constructed. The benchmark covers production, reservoir, and drilling engineering, with 1,200 questions across multiple-choice, true or false, term definition, and short-answer formats. Eight mainstream LLMs were evaluated under a unified API environment. Results show that models performed better on subjective than objective questions, indicating weaknesses in factual knowledge discrimination. The highest accuracies for multiple-choice and true or false questions were 65.3% and 74.3%, respectively. Gemini-3-Pro, Kimi-K2.5, and Claude-Opus-4.6-Thinking achieved the best overall scores of 72%-74%. Models performed best in production engineering and weakest in reservoir engineering. Chinese models showed advantages in multiple-choice questions, while international models performed slightly better in short-answer questions. The benchmark provides a reproducible and practical reference for evaluating and deploying LLMs in petroleum engineering.
title	PetroBench: A Benchmark for Large Language Models in Petroleum Engineering
topic	Artificial Intelligence
url	https://arxiv.org/abs/2605.28032

Similar Items