Saved in:
Bibliographic Details
Main Authors: Aghazadeh-Chakherlou, Robab, Guo, Qing, Khastgir, Siddartha, Popov, Peter, Zhang, Xiaoge, Zhao, Xingyu
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.00527
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911404902055936
author Aghazadeh-Chakherlou, Robab
Guo, Qing
Khastgir, Siddartha
Popov, Peter
Zhang, Xiaoge
Zhao, Xingyu
author_facet Aghazadeh-Chakherlou, Robab
Guo, Qing
Khastgir, Siddartha
Popov, Peter
Zhang, Xiaoge
Zhao, Xingyu
contents Large Language Models (LLMs) are increasingly deployed across diverse domains, raising the need for rigorous reliability assessment methods. Existing benchmark-based evaluations primarily offer descriptive statistics of model accuracy over datasets, providing limited insight into the probabilistic behavior of LLMs under real operational conditions. This paper introduces HIP-LLM, a Hierarchical Imprecise Probability framework for modeling and inferring LLM reliability. Building upon the foundations of software reliability engineering, HIP-LLM defines LLM reliability as the probability of failure-free operation over a specified number of future tasks under a given Operational Profile (OP). HIP-LLM represents dependencies across (sub-)domains hierarchically, enabling multi-level inference from subdomain to system-level reliability. HIP-LLM embeds imprecise priors to capture epistemic uncertainty and incorporates OPs to reflect usage contexts. It derives posterior reliability envelopes that quantify uncertainty across priors and data. Experiments on multiple benchmark datasets demonstrate that HIP-LLM offers a more accurate and standardized reliability characterization than existing benchmark and state-of-the-art approaches. A publicly accessible repository of HIP-LLM is provided.
format Preprint
id arxiv_https___arxiv_org_abs_2511_00527
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle A Hierarchical Imprecise Probability Approach to Reliability Assessment of Large Language Models
Aghazadeh-Chakherlou, Robab
Guo, Qing
Khastgir, Siddartha
Popov, Peter
Zhang, Xiaoge
Zhao, Xingyu
Software Engineering
Artificial Intelligence
Large Language Models (LLMs) are increasingly deployed across diverse domains, raising the need for rigorous reliability assessment methods. Existing benchmark-based evaluations primarily offer descriptive statistics of model accuracy over datasets, providing limited insight into the probabilistic behavior of LLMs under real operational conditions. This paper introduces HIP-LLM, a Hierarchical Imprecise Probability framework for modeling and inferring LLM reliability. Building upon the foundations of software reliability engineering, HIP-LLM defines LLM reliability as the probability of failure-free operation over a specified number of future tasks under a given Operational Profile (OP). HIP-LLM represents dependencies across (sub-)domains hierarchically, enabling multi-level inference from subdomain to system-level reliability. HIP-LLM embeds imprecise priors to capture epistemic uncertainty and incorporates OPs to reflect usage contexts. It derives posterior reliability envelopes that quantify uncertainty across priors and data. Experiments on multiple benchmark datasets demonstrate that HIP-LLM offers a more accurate and standardized reliability characterization than existing benchmark and state-of-the-art approaches. A publicly accessible repository of HIP-LLM is provided.
title A Hierarchical Imprecise Probability Approach to Reliability Assessment of Large Language Models
topic Software Engineering
Artificial Intelligence
url https://arxiv.org/abs/2511.00527