Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Aghazadeh-Chakherlou, Robab, Guo, Qing, Khastgir, Siddartha, Popov, Peter, Zhang, Xiaoge, Zhao, Xingyu
Format:	Preprint
Published:	2025
Subjects:	Software Engineering Artificial Intelligence
Online Access:	https://arxiv.org/abs/2511.00527
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911404902055936
author	Aghazadeh-Chakherlou, Robab Guo, Qing Khastgir, Siddartha Popov, Peter Zhang, Xiaoge Zhao, Xingyu
author_facet	Aghazadeh-Chakherlou, Robab Guo, Qing Khastgir, Siddartha Popov, Peter Zhang, Xiaoge Zhao, Xingyu
contents	Large Language Models (LLMs) are increasingly deployed across diverse domains, raising the need for rigorous reliability assessment methods. Existing benchmark-based evaluations primarily offer descriptive statistics of model accuracy over datasets, providing limited insight into the probabilistic behavior of LLMs under real operational conditions. This paper introduces HIP-LLM, a Hierarchical Imprecise Probability framework for modeling and inferring LLM reliability. Building upon the foundations of software reliability engineering, HIP-LLM defines LLM reliability as the probability of failure-free operation over a specified number of future tasks under a given Operational Profile (OP). HIP-LLM represents dependencies across (sub-)domains hierarchically, enabling multi-level inference from subdomain to system-level reliability. HIP-LLM embeds imprecise priors to capture epistemic uncertainty and incorporates OPs to reflect usage contexts. It derives posterior reliability envelopes that quantify uncertainty across priors and data. Experiments on multiple benchmark datasets demonstrate that HIP-LLM offers a more accurate and standardized reliability characterization than existing benchmark and state-of-the-art approaches. A publicly accessible repository of HIP-LLM is provided.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_00527
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	A Hierarchical Imprecise Probability Approach to Reliability Assessment of Large Language Models Aghazadeh-Chakherlou, Robab Guo, Qing Khastgir, Siddartha Popov, Peter Zhang, Xiaoge Zhao, Xingyu Software Engineering Artificial Intelligence Large Language Models (LLMs) are increasingly deployed across diverse domains, raising the need for rigorous reliability assessment methods. Existing benchmark-based evaluations primarily offer descriptive statistics of model accuracy over datasets, providing limited insight into the probabilistic behavior of LLMs under real operational conditions. This paper introduces HIP-LLM, a Hierarchical Imprecise Probability framework for modeling and inferring LLM reliability. Building upon the foundations of software reliability engineering, HIP-LLM defines LLM reliability as the probability of failure-free operation over a specified number of future tasks under a given Operational Profile (OP). HIP-LLM represents dependencies across (sub-)domains hierarchically, enabling multi-level inference from subdomain to system-level reliability. HIP-LLM embeds imprecise priors to capture epistemic uncertainty and incorporates OPs to reflect usage contexts. It derives posterior reliability envelopes that quantify uncertainty across priors and data. Experiments on multiple benchmark datasets demonstrate that HIP-LLM offers a more accurate and standardized reliability characterization than existing benchmark and state-of-the-art approaches. A publicly accessible repository of HIP-LLM is provided.
title	A Hierarchical Imprecise Probability Approach to Reliability Assessment of Large Language Models
topic	Software Engineering Artificial Intelligence
url	https://arxiv.org/abs/2511.00527

Similar Items