Saved in:
Bibliographic Details
Main Authors: Atanasov, Alexander, Zavatone-Veth, Jacob A., Pehlevan, Cengiz
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2408.04607
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908630324871168
author Atanasov, Alexander
Zavatone-Veth, Jacob A.
Pehlevan, Cengiz
author_facet Atanasov, Alexander
Zavatone-Veth, Jacob A.
Pehlevan, Cengiz
contents Recent years have seen substantial advances in our understanding of high-dimensional ridge regression, but existing theories assume that training examples are independent. By leveraging techniques from random matrix theory and free probability, we provide sharp asymptotics for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations. We demonstrate that in this setting, the generalized cross validation estimator (GCV) fails to correctly predict the out-of-sample risk. However, in the case where the noise residuals have the same correlations as the data points, one can modify the GCV to yield an efficiently-computable unbiased estimator that concentrates in the high-dimensional limit, which we dub CorrGCV. We further extend our asymptotic analysis to the case where the test point has nontrivial correlations with the training set, a setting often encountered in time series forecasting. Assuming knowledge of the correlation structure of the time series, this again yields an extension of the GCV estimator, and sharply characterizes the degree to which such test points yield an overly optimistic prediction of long-time risk. We validate the predictions of our theory across a variety of high dimensional data.
format Preprint
id arxiv_https___arxiv_org_abs_2408_04607
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Risk and cross validation in ridge regression with correlated samples
Atanasov, Alexander
Zavatone-Veth, Jacob A.
Pehlevan, Cengiz
Machine Learning
Disordered Systems and Neural Networks
Recent years have seen substantial advances in our understanding of high-dimensional ridge regression, but existing theories assume that training examples are independent. By leveraging techniques from random matrix theory and free probability, we provide sharp asymptotics for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations. We demonstrate that in this setting, the generalized cross validation estimator (GCV) fails to correctly predict the out-of-sample risk. However, in the case where the noise residuals have the same correlations as the data points, one can modify the GCV to yield an efficiently-computable unbiased estimator that concentrates in the high-dimensional limit, which we dub CorrGCV. We further extend our asymptotic analysis to the case where the test point has nontrivial correlations with the training set, a setting often encountered in time series forecasting. Assuming knowledge of the correlation structure of the time series, this again yields an extension of the GCV estimator, and sharply characterizes the degree to which such test points yield an overly optimistic prediction of long-time risk. We validate the predictions of our theory across a variety of high dimensional data.
title Risk and cross validation in ridge regression with correlated samples
topic Machine Learning
Disordered Systems and Neural Networks
url https://arxiv.org/abs/2408.04607