Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Erica, Sagan, Naomi, Tse, Danny, Zhang, Fangzhao, Pilanci, Mert, Blanchet, Jose
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2601.21410
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Large language models (LLMs) encode rich semantic knowledge that can be useful for supervised learning, but their outputs are unreliable as statistical priors: they may be noisy, misspecified, or hallucinated. Existing LLM-informed learning methods either trust such signals directly, leaving predictions vulnerable to unreliable LLM guidance, or restrict semantic integration to a single model class. We introduce Statsformer, a validated framework for learning when to trust LLM-derived semantic priors in supervised statistical learning. Statsformer maps LLM-derived feature scores into a family of learner-specific prior-injection mechanisms across a heterogeneous library of linear and nonlinear predictors. It then uses out-of-fold validation to adaptively calibrate the influence of each prior-informed learner, allowing useful semantic information to improve prediction while attenuating weak, misspecified, or adversarial priors. This yields a guardrailed statistical learning system with an oracle-style guarantee: up to statistical error, the final predictor performs no worse than the best convex combination of its in-library candidates, including prior-free learners. Across diverse prediction tasks, informative LLM priors improve performance, while unreliable priors are automatically downweighted. These results position Statsformer as a reliability-oriented approach to LLM-informed statistical learning: rather than trusting LLM knowledge directly, it validates semantic priors against data before allowing them to influence the final predictor.

Similar Items