Saved in:
Bibliographic Details
Main Authors: Wang, Xiang, Zhang, Tingting, Wang, Sen, Wu, Ying, Meng, Heng, Zhou, Peng, Li, Peng
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.28032
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918526650941440
author Wang, Xiang
Zhang, Tingting
Wang, Sen
Wu, Ying
Meng, Heng
Zhou, Peng
Li, Peng
author_facet Wang, Xiang
Zhang, Tingting
Wang, Sen
Wu, Ying
Meng, Heng
Zhou, Peng
Li, Peng
contents Large Language Models are increasingly applied in the petroleum industry, highlighting the need for a domain-specific evaluation framework. This study develops a benchmark for LLMs in petroleum engineering, including a three-stage process of data preprocessing, quality filtering, and multi-model validation. Using expert review, a standardized question bank with strong domain relevance and discriminative capability was constructed. The benchmark covers production, reservoir, and drilling engineering, with 1,200 questions across multiple-choice, true or false, term definition, and short-answer formats. Eight mainstream LLMs were evaluated under a unified API environment. Results show that models performed better on subjective than objective questions, indicating weaknesses in factual knowledge discrimination. The highest accuracies for multiple-choice and true or false questions were 65.3% and 74.3%, respectively. Gemini-3-Pro, Kimi-K2.5, and Claude-Opus-4.6-Thinking achieved the best overall scores of 72%-74%. Models performed best in production engineering and weakest in reservoir engineering. Chinese models showed advantages in multiple-choice questions, while international models performed slightly better in short-answer questions. The benchmark provides a reproducible and practical reference for evaluating and deploying LLMs in petroleum engineering.
format Preprint
id arxiv_https___arxiv_org_abs_2605_28032
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle PetroBench: A Benchmark for Large Language Models in Petroleum Engineering
Wang, Xiang
Zhang, Tingting
Wang, Sen
Wu, Ying
Meng, Heng
Zhou, Peng
Li, Peng
Artificial Intelligence
Large Language Models are increasingly applied in the petroleum industry, highlighting the need for a domain-specific evaluation framework. This study develops a benchmark for LLMs in petroleum engineering, including a three-stage process of data preprocessing, quality filtering, and multi-model validation. Using expert review, a standardized question bank with strong domain relevance and discriminative capability was constructed. The benchmark covers production, reservoir, and drilling engineering, with 1,200 questions across multiple-choice, true or false, term definition, and short-answer formats. Eight mainstream LLMs were evaluated under a unified API environment. Results show that models performed better on subjective than objective questions, indicating weaknesses in factual knowledge discrimination. The highest accuracies for multiple-choice and true or false questions were 65.3% and 74.3%, respectively. Gemini-3-Pro, Kimi-K2.5, and Claude-Opus-4.6-Thinking achieved the best overall scores of 72%-74%. Models performed best in production engineering and weakest in reservoir engineering. Chinese models showed advantages in multiple-choice questions, while international models performed slightly better in short-answer questions. The benchmark provides a reproducible and practical reference for evaluating and deploying LLMs in petroleum engineering.
title PetroBench: A Benchmark for Large Language Models in Petroleum Engineering
topic Artificial Intelligence
url https://arxiv.org/abs/2605.28032