Saved in:
Bibliographic Details
Main Author: Marín, Javier
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.11596
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918202275004416
author Marín, Javier
author_facet Marín, Javier
contents Loss Given Default (LGD) modeling faces a fundamental data quality constraint: 90% of available training data consists of proxy estimates based on pre-distress balance sheets rather than actual recovery outcomes from completed bankruptcy proceedings. We demonstrate that this mixture-contaminated training structure causes systematic failure of recursive partitioning methods, with Random Forest achieving negative r-squared (-0.664, worse than predicting the mean) on held-out test data. Information-theoretic approaches based on Shannon entropy and mutual information provide superior generalization, achieving r-squared of 0.191 and RMSE of 0.284 on 1,218 corporate bankruptcies (1980-2023). Analysis reveals that leverage-based features contain 1.510 bits of mutual information while size effects contribute only 0.086 bits, contradicting regulatory assumptions about scale-dependent recovery. These results establish practical guidance for financial institutions deploying LGD models under Basel III requirements when representative outcome data is unavailable at sufficient scale. The findings generalize to medical outcomes research, climate forecasting, and technology reliability-domains where extended observation periods create unavoidable mixture structure in training data.
format Preprint
id arxiv_https___arxiv_org_abs_2511_11596
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Loss Given Default Prediction Under Measurement-Induced Mixture Distributions: An Information-Theoretic Approach
Marín, Javier
Machine Learning
Artificial Intelligence
Loss Given Default (LGD) modeling faces a fundamental data quality constraint: 90% of available training data consists of proxy estimates based on pre-distress balance sheets rather than actual recovery outcomes from completed bankruptcy proceedings. We demonstrate that this mixture-contaminated training structure causes systematic failure of recursive partitioning methods, with Random Forest achieving negative r-squared (-0.664, worse than predicting the mean) on held-out test data. Information-theoretic approaches based on Shannon entropy and mutual information provide superior generalization, achieving r-squared of 0.191 and RMSE of 0.284 on 1,218 corporate bankruptcies (1980-2023). Analysis reveals that leverage-based features contain 1.510 bits of mutual information while size effects contribute only 0.086 bits, contradicting regulatory assumptions about scale-dependent recovery. These results establish practical guidance for financial institutions deploying LGD models under Basel III requirements when representative outcome data is unavailable at sufficient scale. The findings generalize to medical outcomes research, climate forecasting, and technology reliability-domains where extended observation periods create unavoidable mixture structure in training data.
title Loss Given Default Prediction Under Measurement-Induced Mixture Distributions: An Information-Theoretic Approach
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2511.11596