Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Xianming, Li, Jing
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2401.05883
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909333352087552
author	Li, Xianming Li, Jing
author_facet	Li, Xianming Li, Jing
contents	Social media data exhibits severe redundancy caused by its noisy nature. It leads to increased training time and model bias in its processing. To address this issue, we propose a novel Generative Deduplication framework for social media data selection by removing semantically duplicate data. While related work involves data selection in task-specific training, our model acts as an efficient pre-processing method to universally enhance social media NLP pipelines. Specifically, we train a generative model via self-supervised learning to predict a keyword to capture the semantics of noisy social media text for deduplication. Meanwhile, time-dimensional Gaussian noise is added to improve training complexity and avoid learning trivial features. Extensive experiments suggest that our model can better reduce training samples while improving performance than baselines. The results show our model's potential to broadly advance social media language understanding in effectiveness and efficiency.
format	Preprint
id	arxiv_https___arxiv_org_abs_2401_05883
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Generative Deduplication For Socia Media Data Selection Li, Xianming Li, Jing Computation and Language Social media data exhibits severe redundancy caused by its noisy nature. It leads to increased training time and model bias in its processing. To address this issue, we propose a novel Generative Deduplication framework for social media data selection by removing semantically duplicate data. While related work involves data selection in task-specific training, our model acts as an efficient pre-processing method to universally enhance social media NLP pipelines. Specifically, we train a generative model via self-supervised learning to predict a keyword to capture the semantics of noisy social media text for deduplication. Meanwhile, time-dimensional Gaussian noise is added to improve training complexity and avoid learning trivial features. Extensive experiments suggest that our model can better reduce training samples while improving performance than baselines. The results show our model's potential to broadly advance social media language understanding in effectiveness and efficiency.
title	Generative Deduplication For Socia Media Data Selection
topic	Computation and Language
url	https://arxiv.org/abs/2401.05883

Similar Items