Saved in:
Bibliographic Details
Main Authors: Le, Benjamin, Lu, Xueying, Stern, Nick, Liu, Wenqiong, Lapchuk, Igor, Li, Xiang, Zheng, Baofen, Rosenberg, Kevin, Huang, Jiewen, Zhang, Zhe, Cabangbang, Abraham, Wagle, Satej Milind, Shen, Jianqiang, Muthuregunathan, Raghavan, Gupta, Abhinav, Teoh, Mathew, Kirk, Andrew, Kwan, Thomas, Wu, Jingwei, Zhang, Wenjing
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.07840
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917262604107776
author Le, Benjamin
Lu, Xueying
Stern, Nick
Liu, Wenqiong
Lapchuk, Igor
Li, Xiang
Zheng, Baofen
Rosenberg, Kevin
Huang, Jiewen
Zhang, Zhe
Cabangbang, Abraham
Wagle, Satej Milind
Shen, Jianqiang
Muthuregunathan, Raghavan
Gupta, Abhinav
Teoh, Mathew
Kirk, Andrew
Kwan, Thomas
Wu, Jingwei
Zhang, Wenjing
author_facet Le, Benjamin
Lu, Xueying
Stern, Nick
Liu, Wenqiong
Lapchuk, Igor
Li, Xiang
Zheng, Baofen
Rosenberg, Kevin
Huang, Jiewen
Zhang, Zhe
Cabangbang, Abraham
Wagle, Satej Milind
Shen, Jianqiang
Muthuregunathan, Raghavan
Gupta, Abhinav
Teoh, Mathew
Kirk, Andrew
Kwan, Thomas
Wu, Jingwei
Zhang, Wenjing
contents Evaluating relevance in large-scale search systems is fundamentally constrained by the governance gap between nuanced, resource-constrained human oversight and the high-throughput requirements of production systems. While traditional approaches rely on engagement proxies or sparse manual review, these methods often fail to capture the full scope of high-impact relevance failures. We present \textbf{SAGE} (Scalable AI Governance \& Evaluation), a framework that operationalizes high-quality human product judgment as a scalable evaluation signal. At the core of SAGE is a bidirectional calibration loop where natural-language \emph{Policy}, curated \emph{Precedent}, and an \emph{LLM Surrogate Judge} co-evolve. SAGE systematically resolves semantic ambiguities and misalignments, transforming subjective relevance judgment into an executable, multi-dimensional rubric with near human-level agreement. To bridge the gap between frontier model reasoning and industrial-scale inference, we apply teacher-student distillation to transfer high-fidelity judgments into compact student surrogates at \textbf{92$\times$} lower cost. Deployed within LinkedIn Search ecosystems, SAGE guided model iteration through simulation-driven development, distilling policy-aligned models for online serving and enabling rapid offline evaluation. In production, it powered policy oversight that measured ramped model variants and detected regressions invisible to engagement metrics. Collectively, these drove a \textbf{0.25\%} lift in LinkedIn daily active users.
format Preprint
id arxiv_https___arxiv_org_abs_2602_07840
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle SAGE: Scalable AI Governance & Evaluation
Le, Benjamin
Lu, Xueying
Stern, Nick
Liu, Wenqiong
Lapchuk, Igor
Li, Xiang
Zheng, Baofen
Rosenberg, Kevin
Huang, Jiewen
Zhang, Zhe
Cabangbang, Abraham
Wagle, Satej Milind
Shen, Jianqiang
Muthuregunathan, Raghavan
Gupta, Abhinav
Teoh, Mathew
Kirk, Andrew
Kwan, Thomas
Wu, Jingwei
Zhang, Wenjing
Information Retrieval
Artificial Intelligence
Evaluating relevance in large-scale search systems is fundamentally constrained by the governance gap between nuanced, resource-constrained human oversight and the high-throughput requirements of production systems. While traditional approaches rely on engagement proxies or sparse manual review, these methods often fail to capture the full scope of high-impact relevance failures. We present \textbf{SAGE} (Scalable AI Governance \& Evaluation), a framework that operationalizes high-quality human product judgment as a scalable evaluation signal. At the core of SAGE is a bidirectional calibration loop where natural-language \emph{Policy}, curated \emph{Precedent}, and an \emph{LLM Surrogate Judge} co-evolve. SAGE systematically resolves semantic ambiguities and misalignments, transforming subjective relevance judgment into an executable, multi-dimensional rubric with near human-level agreement. To bridge the gap between frontier model reasoning and industrial-scale inference, we apply teacher-student distillation to transfer high-fidelity judgments into compact student surrogates at \textbf{92$\times$} lower cost. Deployed within LinkedIn Search ecosystems, SAGE guided model iteration through simulation-driven development, distilling policy-aligned models for online serving and enabling rapid offline evaluation. In production, it powered policy oversight that measured ramped model variants and detected regressions invisible to engagement metrics. Collectively, these drove a \textbf{0.25\%} lift in LinkedIn daily active users.
title SAGE: Scalable AI Governance & Evaluation
topic Information Retrieval
Artificial Intelligence
url https://arxiv.org/abs/2602.07840