Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Marín, Javier
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2511.11596
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918202275004416
author	Marín, Javier
author_facet	Marín, Javier
contents	Loss Given Default (LGD) modeling faces a fundamental data quality constraint: 90% of available training data consists of proxy estimates based on pre-distress balance sheets rather than actual recovery outcomes from completed bankruptcy proceedings. We demonstrate that this mixture-contaminated training structure causes systematic failure of recursive partitioning methods, with Random Forest achieving negative r-squared (-0.664, worse than predicting the mean) on held-out test data. Information-theoretic approaches based on Shannon entropy and mutual information provide superior generalization, achieving r-squared of 0.191 and RMSE of 0.284 on 1,218 corporate bankruptcies (1980-2023). Analysis reveals that leverage-based features contain 1.510 bits of mutual information while size effects contribute only 0.086 bits, contradicting regulatory assumptions about scale-dependent recovery. These results establish practical guidance for financial institutions deploying LGD models under Basel III requirements when representative outcome data is unavailable at sufficient scale. The findings generalize to medical outcomes research, climate forecasting, and technology reliability-domains where extended observation periods create unavoidable mixture structure in training data.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_11596
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Loss Given Default Prediction Under Measurement-Induced Mixture Distributions: An Information-Theoretic Approach Marín, Javier Machine Learning Artificial Intelligence Loss Given Default (LGD) modeling faces a fundamental data quality constraint: 90% of available training data consists of proxy estimates based on pre-distress balance sheets rather than actual recovery outcomes from completed bankruptcy proceedings. We demonstrate that this mixture-contaminated training structure causes systematic failure of recursive partitioning methods, with Random Forest achieving negative r-squared (-0.664, worse than predicting the mean) on held-out test data. Information-theoretic approaches based on Shannon entropy and mutual information provide superior generalization, achieving r-squared of 0.191 and RMSE of 0.284 on 1,218 corporate bankruptcies (1980-2023). Analysis reveals that leverage-based features contain 1.510 bits of mutual information while size effects contribute only 0.086 bits, contradicting regulatory assumptions about scale-dependent recovery. These results establish practical guidance for financial institutions deploying LGD models under Basel III requirements when representative outcome data is unavailable at sufficient scale. The findings generalize to medical outcomes research, climate forecasting, and technology reliability-domains where extended observation periods create unavoidable mixture structure in training data.
title	Loss Given Default Prediction Under Measurement-Induced Mixture Distributions: An Information-Theoretic Approach
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2511.11596

Similar Items