Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.12366 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913695547785216 |
|---|---|
| author | Huang, Tzu-Heng Cao, Catherine Schoenberg, Spencer Vishwakarma, Harit Roberts, Nicholas Sala, Frederic |
| author_facet | Huang, Tzu-Heng Cao, Catherine Schoenberg, Spencer Vishwakarma, Harit Roberts, Nicholas Sala, Frederic |
| contents | Weak supervision is a popular framework for overcoming the labeled data bottleneck: the need to obtain labels for training data. In weak supervision, multiple noisy-but-cheap sources are used to provide guesses of the label and are aggregated to produce high-quality pseudolabels. These sources are often expressed as small programs written by domain experts -- and so are expensive to obtain. Instead, we argue for using code-generation models to act as coding assistants for crafting weak supervision sources. We study prompting strategies to maximize the quality of the generated sources, settling on a multi-tier strategy that incorporates multiple types of information. We explore how to best combine hand-written and generated sources. Using these insights, we introduce ScriptoriumWS, a weak supervision system that, when compared to hand-crafted sources, maintains accuracy and greatly improves coverage. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2502_12366 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | ScriptoriumWS: A Code Generation Assistant for Weak Supervision Huang, Tzu-Heng Cao, Catherine Schoenberg, Spencer Vishwakarma, Harit Roberts, Nicholas Sala, Frederic Machine Learning Weak supervision is a popular framework for overcoming the labeled data bottleneck: the need to obtain labels for training data. In weak supervision, multiple noisy-but-cheap sources are used to provide guesses of the label and are aggregated to produce high-quality pseudolabels. These sources are often expressed as small programs written by domain experts -- and so are expensive to obtain. Instead, we argue for using code-generation models to act as coding assistants for crafting weak supervision sources. We study prompting strategies to maximize the quality of the generated sources, settling on a multi-tier strategy that incorporates multiple types of information. We explore how to best combine hand-written and generated sources. Using these insights, we introduce ScriptoriumWS, a weak supervision system that, when compared to hand-crafted sources, maintains accuracy and greatly improves coverage. |
| title | ScriptoriumWS: A Code Generation Assistant for Weak Supervision |
| topic | Machine Learning |
| url | https://arxiv.org/abs/2502.12366 |