Saved in:
Bibliographic Details
Main Authors: Bai, Yatong, Garg, Utsav, Shanker, Apaar, Zhang, Haoming, Parajuli, Samyak, Bas, Erhan, Filipovic, Isidora, Chu, Amelia N., Fomitcheva, Eugenia D, Branson, Elliot, Kim, Aerin, Sojoudi, Somayeh, Cho, Kyunghyun
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2401.04575
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929265459593216
author Bai, Yatong
Garg, Utsav
Shanker, Apaar
Zhang, Haoming
Parajuli, Samyak
Bas, Erhan
Filipovic, Isidora
Chu, Amelia N.
Fomitcheva, Eugenia D
Branson, Elliot
Kim, Aerin
Sojoudi, Somayeh
Cho, Kyunghyun
author_facet Bai, Yatong
Garg, Utsav
Shanker, Apaar
Zhang, Haoming
Parajuli, Samyak
Bas, Erhan
Filipovic, Isidora
Chu, Amelia N.
Fomitcheva, Eugenia D
Branson, Elliot
Kim, Aerin
Sojoudi, Somayeh
Cho, Kyunghyun
contents Vision and vision-language applications of neural networks, such as image classification and captioning, rely on large-scale annotated datasets that require non-trivial data-collecting processes. This time-consuming endeavor hinders the emergence of large-scale datasets, limiting researchers and practitioners to a small number of choices. Therefore, we seek more efficient ways to collect and annotate images. Previous initiatives have gathered captions from HTML alt-texts and crawled social media postings, but these data sources suffer from noise, sparsity, or subjectivity. For this reason, we turn to commercial shopping websites whose data meet three criteria: cleanliness, informativeness, and fluency. We introduce the Let's Go Shopping (LGS) dataset, a large-scale public dataset with 15 million image-caption pairs from publicly available e-commerce websites. When compared with existing general-domain datasets, the LGS images focus on the foreground object and have less complex backgrounds. Our experiments on LGS show that the classifiers trained on existing benchmark datasets do not readily generalize to e-commerce data, while specific self-supervised visual feature extractors can better generalize. Furthermore, LGS's high-quality e-commerce-focused images and bimodal nature make it advantageous for vision-language bi-modal tasks: LGS enables image-captioning models to generate richer captions and helps text-to-image generation models achieve e-commerce style transfer.
format Preprint
id arxiv_https___arxiv_org_abs_2401_04575
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
Bai, Yatong
Garg, Utsav
Shanker, Apaar
Zhang, Haoming
Parajuli, Samyak
Bas, Erhan
Filipovic, Isidora
Chu, Amelia N.
Fomitcheva, Eugenia D
Branson, Elliot
Kim, Aerin
Sojoudi, Somayeh
Cho, Kyunghyun
Computer Vision and Pattern Recognition
Artificial Intelligence
Vision and vision-language applications of neural networks, such as image classification and captioning, rely on large-scale annotated datasets that require non-trivial data-collecting processes. This time-consuming endeavor hinders the emergence of large-scale datasets, limiting researchers and practitioners to a small number of choices. Therefore, we seek more efficient ways to collect and annotate images. Previous initiatives have gathered captions from HTML alt-texts and crawled social media postings, but these data sources suffer from noise, sparsity, or subjectivity. For this reason, we turn to commercial shopping websites whose data meet three criteria: cleanliness, informativeness, and fluency. We introduce the Let's Go Shopping (LGS) dataset, a large-scale public dataset with 15 million image-caption pairs from publicly available e-commerce websites. When compared with existing general-domain datasets, the LGS images focus on the foreground object and have less complex backgrounds. Our experiments on LGS show that the classifiers trained on existing benchmark datasets do not readily generalize to e-commerce data, while specific self-supervised visual feature extractors can better generalize. Furthermore, LGS's high-quality e-commerce-focused images and bimodal nature make it advantageous for vision-language bi-modal tasks: LGS enables image-captioning models to generate richer captions and helps text-to-image generation models achieve e-commerce style transfer.
title Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2401.04575