Saved in:
Bibliographic Details
Main Authors: Zhang, Erica, Sagan, Naomi, Tse, Danny, Zhang, Fangzhao, Pilanci, Mert, Blanchet, Jose
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.21410
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Large language models (LLMs) encode rich semantic knowledge that can be useful for supervised learning, but their outputs are unreliable as statistical priors: they may be noisy, misspecified, or hallucinated. Existing LLM-informed learning methods either trust such signals directly, leaving predictions vulnerable to unreliable LLM guidance, or restrict semantic integration to a single model class. We introduce Statsformer, a validated framework for learning when to trust LLM-derived semantic priors in supervised statistical learning. Statsformer maps LLM-derived feature scores into a family of learner-specific prior-injection mechanisms across a heterogeneous library of linear and nonlinear predictors. It then uses out-of-fold validation to adaptively calibrate the influence of each prior-informed learner, allowing useful semantic information to improve prediction while attenuating weak, misspecified, or adversarial priors. This yields a guardrailed statistical learning system with an oracle-style guarantee: up to statistical error, the final predictor performs no worse than the best convex combination of its in-library candidates, including prior-free learners. Across diverse prediction tasks, informative LLM priors improve performance, while unreliable priors are automatically downweighted. These results position Statsformer as a reliability-oriented approach to LLM-informed statistical learning: rather than trusting LLM knowledge directly, it validates semantic priors against data before allowing them to influence the final predictor.