Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Huang, Tzu-Heng, Cao, Catherine, Schoenberg, Spencer, Vishwakarma, Harit, Roberts, Nicholas, Sala, Frederic
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2502.12366
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913695547785216
author	Huang, Tzu-Heng Cao, Catherine Schoenberg, Spencer Vishwakarma, Harit Roberts, Nicholas Sala, Frederic
author_facet	Huang, Tzu-Heng Cao, Catherine Schoenberg, Spencer Vishwakarma, Harit Roberts, Nicholas Sala, Frederic
contents	Weak supervision is a popular framework for overcoming the labeled data bottleneck: the need to obtain labels for training data. In weak supervision, multiple noisy-but-cheap sources are used to provide guesses of the label and are aggregated to produce high-quality pseudolabels. These sources are often expressed as small programs written by domain experts -- and so are expensive to obtain. Instead, we argue for using code-generation models to act as coding assistants for crafting weak supervision sources. We study prompting strategies to maximize the quality of the generated sources, settling on a multi-tier strategy that incorporates multiple types of information. We explore how to best combine hand-written and generated sources. Using these insights, we introduce ScriptoriumWS, a weak supervision system that, when compared to hand-crafted sources, maintains accuracy and greatly improves coverage.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_12366
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	ScriptoriumWS: A Code Generation Assistant for Weak Supervision Huang, Tzu-Heng Cao, Catherine Schoenberg, Spencer Vishwakarma, Harit Roberts, Nicholas Sala, Frederic Machine Learning Weak supervision is a popular framework for overcoming the labeled data bottleneck: the need to obtain labels for training data. In weak supervision, multiple noisy-but-cheap sources are used to provide guesses of the label and are aggregated to produce high-quality pseudolabels. These sources are often expressed as small programs written by domain experts -- and so are expensive to obtain. Instead, we argue for using code-generation models to act as coding assistants for crafting weak supervision sources. We study prompting strategies to maximize the quality of the generated sources, settling on a multi-tier strategy that incorporates multiple types of information. We explore how to best combine hand-written and generated sources. Using these insights, we introduce ScriptoriumWS, a weak supervision system that, when compared to hand-crafted sources, maintains accuracy and greatly improves coverage.
title	ScriptoriumWS: A Code Generation Assistant for Weak Supervision
topic	Machine Learning
url	https://arxiv.org/abs/2502.12366

Similar Items