Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Schelb, Julian, Ulloa, Roberto, Spitz, Andreas
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2407.16516
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913442525347840
author	Schelb, Julian Ulloa, Roberto Spitz, Andreas
author_facet	Schelb, Julian Ulloa, Roberto Spitz, Andreas
contents	Researchers in the political and social sciences often rely on classification models to analyze trends in information consumption by examining browsing histories of millions of webpages. Automated scalable methods are necessary due to the impracticality of manual labeling. In this paper, we model the detection of topic-related content as a binary classification task and compare the accuracy of fine-tuned pre-trained encoder models against in-context learning strategies. Using only a few hundred annotated data points per topic, we detect content related to three German policies in a database of scraped webpages. We compare multilingual and monolingual models, as well as zero and few-shot approaches, and investigate the impact of negative sampling strategies and the combination of URL & content-based features. Our results show that a small sample of annotated data is sufficient to train an effective classifier. Fine-tuning encoder-based models yields better results than in-context learning. Classifiers using both URL & content-based features perform best, while using URLs alone provides adequate results when content is unavailable.
format	Preprint
id	arxiv_https___arxiv_org_abs_2407_16516
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Assessing In-context Learning and Fine-tuning for Topic Classification of German Web Data Schelb, Julian Ulloa, Roberto Spitz, Andreas Computation and Language Researchers in the political and social sciences often rely on classification models to analyze trends in information consumption by examining browsing histories of millions of webpages. Automated scalable methods are necessary due to the impracticality of manual labeling. In this paper, we model the detection of topic-related content as a binary classification task and compare the accuracy of fine-tuned pre-trained encoder models against in-context learning strategies. Using only a few hundred annotated data points per topic, we detect content related to three German policies in a database of scraped webpages. We compare multilingual and monolingual models, as well as zero and few-shot approaches, and investigate the impact of negative sampling strategies and the combination of URL & content-based features. Our results show that a small sample of annotated data is sufficient to train an effective classifier. Fine-tuning encoder-based models yields better results than in-context learning. Classifiers using both URL & content-based features perform best, while using URLs alone provides adequate results when content is unavailable.
title	Assessing In-context Learning and Fine-tuning for Topic Classification of German Web Data
topic	Computation and Language
url	https://arxiv.org/abs/2407.16516

Similar Items