MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Gorriz, Juan M, Ramirez, J., Segovia, F., Martinez-Murcia, F. J., Jiménez-Mesa, C., Suckling, J.
Natura:	Preprint
Pubblicazione:	2024
Soggetti:	Machine Learning Statistics Theory Computation
Accesso online:	https://arxiv.org/abs/2402.15213
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866912359245676544
author	Gorriz, Juan M Ramirez, J. Segovia, F. Martinez-Murcia, F. J. Jiménez-Mesa, C. Suckling, J.
author_facet	Gorriz, Juan M Ramirez, J. Segovia, F. Martinez-Murcia, F. J. Jiménez-Mesa, C. Suckling, J.
contents	Regression analysis is a central topic in statistical modeling, aimed at estimating the relationships between a dependent variable, commonly referred to as the response variable, and one or more independent variables, i.e., explanatory variables. Linear regression is by far the most popular method for performing this task in various fields of research, such as data integration and predictive modeling when combining information from multiple sources. Classical methods for solving linear regression problems, such as Ordinary Least Squares (OLS), Ridge, or Lasso regressions, often form the foundation for more advanced machine learning (ML) techniques, which have been successfully applied, though without a formal definition of statistical significance. At most, permutation or analyses based on empirical measures (e.g., residuals or accuracy) have been conducted, leveraging the greater sensitivity of ML estimations for detection. In this paper, we introduce Statistical Agnostic Regression (SAR) for evaluating the statistical significance of ML-based linear regression models. This is achieved by analyzing concentration inequalities of the actual risk (expected loss) and considering the worst-case scenario. To this end, we define a threshold that ensures there is sufficient evidence, with a probability of at least $1-η$, to conclude the existence of a linear relationship in the population between the explanatory (feature) and the response (label) variables. Simulations demonstrate the ability of the proposed agnostic (non-parametric) test to provide an analysis of variance similar to the classical multivariate $F$-test for the slope parameter, without relying on the underlying assumptions of classical methods. Moreover, the residuals computed from this method represent a trade-off between those obtained from ML approaches and the classical OLS.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_15213
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Statistical Agnostic Regression: a machine learning method to validate regression models Gorriz, Juan M Ramirez, J. Segovia, F. Martinez-Murcia, F. J. Jiménez-Mesa, C. Suckling, J. Machine Learning Statistics Theory Computation Regression analysis is a central topic in statistical modeling, aimed at estimating the relationships between a dependent variable, commonly referred to as the response variable, and one or more independent variables, i.e., explanatory variables. Linear regression is by far the most popular method for performing this task in various fields of research, such as data integration and predictive modeling when combining information from multiple sources. Classical methods for solving linear regression problems, such as Ordinary Least Squares (OLS), Ridge, or Lasso regressions, often form the foundation for more advanced machine learning (ML) techniques, which have been successfully applied, though without a formal definition of statistical significance. At most, permutation or analyses based on empirical measures (e.g., residuals or accuracy) have been conducted, leveraging the greater sensitivity of ML estimations for detection. In this paper, we introduce Statistical Agnostic Regression (SAR) for evaluating the statistical significance of ML-based linear regression models. This is achieved by analyzing concentration inequalities of the actual risk (expected loss) and considering the worst-case scenario. To this end, we define a threshold that ensures there is sufficient evidence, with a probability of at least $1-η$, to conclude the existence of a linear relationship in the population between the explanatory (feature) and the response (label) variables. Simulations demonstrate the ability of the proposed agnostic (non-parametric) test to provide an analysis of variance similar to the classical multivariate $F$-test for the slope parameter, without relying on the underlying assumptions of classical methods. Moreover, the residuals computed from this method represent a trade-off between those obtained from ML approaches and the classical OLS.
title	Statistical Agnostic Regression: a machine learning method to validate regression models
topic	Machine Learning Statistics Theory Computation
url	https://arxiv.org/abs/2402.15213

Documenti analoghi