Saved in:
Bibliographic Details
Main Authors: Huang, Tzu-Heng, Cao, Catherine, Schoenberg, Spencer, Vishwakarma, Harit, Roberts, Nicholas, Sala, Frederic
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.12366
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913695547785216
author Huang, Tzu-Heng
Cao, Catherine
Schoenberg, Spencer
Vishwakarma, Harit
Roberts, Nicholas
Sala, Frederic
author_facet Huang, Tzu-Heng
Cao, Catherine
Schoenberg, Spencer
Vishwakarma, Harit
Roberts, Nicholas
Sala, Frederic
contents Weak supervision is a popular framework for overcoming the labeled data bottleneck: the need to obtain labels for training data. In weak supervision, multiple noisy-but-cheap sources are used to provide guesses of the label and are aggregated to produce high-quality pseudolabels. These sources are often expressed as small programs written by domain experts -- and so are expensive to obtain. Instead, we argue for using code-generation models to act as coding assistants for crafting weak supervision sources. We study prompting strategies to maximize the quality of the generated sources, settling on a multi-tier strategy that incorporates multiple types of information. We explore how to best combine hand-written and generated sources. Using these insights, we introduce ScriptoriumWS, a weak supervision system that, when compared to hand-crafted sources, maintains accuracy and greatly improves coverage.
format Preprint
id arxiv_https___arxiv_org_abs_2502_12366
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle ScriptoriumWS: A Code Generation Assistant for Weak Supervision
Huang, Tzu-Heng
Cao, Catherine
Schoenberg, Spencer
Vishwakarma, Harit
Roberts, Nicholas
Sala, Frederic
Machine Learning
Weak supervision is a popular framework for overcoming the labeled data bottleneck: the need to obtain labels for training data. In weak supervision, multiple noisy-but-cheap sources are used to provide guesses of the label and are aggregated to produce high-quality pseudolabels. These sources are often expressed as small programs written by domain experts -- and so are expensive to obtain. Instead, we argue for using code-generation models to act as coding assistants for crafting weak supervision sources. We study prompting strategies to maximize the quality of the generated sources, settling on a multi-tier strategy that incorporates multiple types of information. We explore how to best combine hand-written and generated sources. Using these insights, we introduce ScriptoriumWS, a weak supervision system that, when compared to hand-crafted sources, maintains accuracy and greatly improves coverage.
title ScriptoriumWS: A Code Generation Assistant for Weak Supervision
topic Machine Learning
url https://arxiv.org/abs/2502.12366