Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Osooli, Hamid, Batool, Kareema, Gentry, Rick, Roy, Tiasa Singha, Gupta, Ashwin, Ramesh, Anirudha
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.25077
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914512987226112
author	Osooli, Hamid Batool, Kareema Gentry, Rick Roy, Tiasa Singha Gupta, Ashwin Ramesh, Anirudha
author_facet	Osooli, Hamid Batool, Kareema Gentry, Rick Roy, Tiasa Singha Gupta, Ashwin Ramesh, Anirudha
contents	Weak-to-strong alignment offers a promising route to scalable supervision, but it can fail when a strong model becomes confidently wrong on examples that lie in the weak teacher's blind spots. Understanding such failures requires going beyond aggregate accuracy, since weak-to-strong errors depend not only on whether the strong model disagrees with its teacher, but also on how confidence and uncertainty are distributed across examples. In this work, we analyze weak-to-strong alignment through a bias-variance-covariance lens that connects misfit theory to practical post-training pipelines. We derive a misfit-based upper bound on weak-to-strong population risk and study its empirical components using continuous confidence scores. We evaluate four weak-to-strong pipelines spanning supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and reinforcement learning from AI feedback (RLAIF) on the PKU-SafeRLHF and HH-RLHF datasets. Using a blind-spot deception metric that isolates cases where the strong model is confidently wrong while the weak model is uncertain, we find that strong-model variance is the strongest empirical predictor of deception across our settings. Covariance provides additional but weaker information, indicating that weak-strong dependence matters, but does not by itself explain the observed failures. These results suggest that strong-model variance can serve as an early-warning signal for weak-to-strong deception, while blind-spot evaluation helps distinguish whether failures are inherited from weak supervision or arise in regions of weak-model uncertainty.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_25077
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective Osooli, Hamid Batool, Kareema Gentry, Rick Roy, Tiasa Singha Gupta, Ashwin Ramesh, Anirudha Artificial Intelligence Weak-to-strong alignment offers a promising route to scalable supervision, but it can fail when a strong model becomes confidently wrong on examples that lie in the weak teacher's blind spots. Understanding such failures requires going beyond aggregate accuracy, since weak-to-strong errors depend not only on whether the strong model disagrees with its teacher, but also on how confidence and uncertainty are distributed across examples. In this work, we analyze weak-to-strong alignment through a bias-variance-covariance lens that connects misfit theory to practical post-training pipelines. We derive a misfit-based upper bound on weak-to-strong population risk and study its empirical components using continuous confidence scores. We evaluate four weak-to-strong pipelines spanning supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and reinforcement learning from AI feedback (RLAIF) on the PKU-SafeRLHF and HH-RLHF datasets. Using a blind-spot deception metric that isolates cases where the strong model is confidently wrong while the weak model is uncertain, we find that strong-model variance is the strongest empirical predictor of deception across our settings. Covariance provides additional but weaker information, indicating that weak-strong dependence matters, but does not by itself explain the observed failures. These results suggest that strong-model variance can serve as an early-warning signal for weak-to-strong deception, while blind-spot evaluation helps distinguish whether failures are inherited from weak supervision or arise in regions of weak-model uncertainty.
title	Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective
topic	Artificial Intelligence
url	https://arxiv.org/abs/2604.25077

Similar Items