Saved in:
Bibliographic Details
Main Authors: Tóth, Sándor, Wilson, Stephen, Tsoukara, Alexia, Moreu, Enric, Masalovich, Anton, Roemheld, Lars
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2403.11593
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913269725265920
author Tóth, Sándor
Wilson, Stephen
Tsoukara, Alexia
Moreu, Enric
Masalovich, Anton
Roemheld, Lars
author_facet Tóth, Sándor
Wilson, Stephen
Tsoukara, Alexia
Moreu, Enric
Masalovich, Anton
Roemheld, Lars
contents Product matching, the task of identifying different representations of the same product for better discoverability, curation, and pricing, is a key capability for online marketplace and e-commerce companies. We present a robust multi-modal product matching system in an industry setting, where large datasets, data distribution shifts and unseen domains pose challenges. We compare different approaches and conclude that a relatively straightforward projection of pretrained image and text encoders, trained through contrastive learning, yields state-of-the-art results, while balancing cost and performance. Our solution outperforms single modality matching systems and large pretrained models, such as CLIP. Furthermore we show how a human-in-the-loop process can be combined with model-based predictions to achieve near perfect precision in a production system.
format Preprint
id arxiv_https___arxiv_org_abs_2403_11593
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle End-to-end multi-modal product matching in fashion e-commerce
Tóth, Sándor
Wilson, Stephen
Tsoukara, Alexia
Moreu, Enric
Masalovich, Anton
Roemheld, Lars
Computer Vision and Pattern Recognition
Product matching, the task of identifying different representations of the same product for better discoverability, curation, and pricing, is a key capability for online marketplace and e-commerce companies. We present a robust multi-modal product matching system in an industry setting, where large datasets, data distribution shifts and unseen domains pose challenges. We compare different approaches and conclude that a relatively straightforward projection of pretrained image and text encoders, trained through contrastive learning, yields state-of-the-art results, while balancing cost and performance. Our solution outperforms single modality matching systems and large pretrained models, such as CLIP. Furthermore we show how a human-in-the-loop process can be combined with model-based predictions to achieve near perfect precision in a production system.
title End-to-end multi-modal product matching in fashion e-commerce
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2403.11593