Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2403.11593 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913269725265920 |
|---|---|
| author | Tóth, Sándor Wilson, Stephen Tsoukara, Alexia Moreu, Enric Masalovich, Anton Roemheld, Lars |
| author_facet | Tóth, Sándor Wilson, Stephen Tsoukara, Alexia Moreu, Enric Masalovich, Anton Roemheld, Lars |
| contents | Product matching, the task of identifying different representations of the same product for better discoverability, curation, and pricing, is a key capability for online marketplace and e-commerce companies. We present a robust multi-modal product matching system in an industry setting, where large datasets, data distribution shifts and unseen domains pose challenges. We compare different approaches and conclude that a relatively straightforward projection of pretrained image and text encoders, trained through contrastive learning, yields state-of-the-art results, while balancing cost and performance. Our solution outperforms single modality matching systems and large pretrained models, such as CLIP. Furthermore we show how a human-in-the-loop process can be combined with model-based predictions to achieve near perfect precision in a production system. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2403_11593 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | End-to-end multi-modal product matching in fashion e-commerce Tóth, Sándor Wilson, Stephen Tsoukara, Alexia Moreu, Enric Masalovich, Anton Roemheld, Lars Computer Vision and Pattern Recognition Product matching, the task of identifying different representations of the same product for better discoverability, curation, and pricing, is a key capability for online marketplace and e-commerce companies. We present a robust multi-modal product matching system in an industry setting, where large datasets, data distribution shifts and unseen domains pose challenges. We compare different approaches and conclude that a relatively straightforward projection of pretrained image and text encoders, trained through contrastive learning, yields state-of-the-art results, while balancing cost and performance. Our solution outperforms single modality matching systems and large pretrained models, such as CLIP. Furthermore we show how a human-in-the-loop process can be combined with model-based predictions to achieve near perfect precision in a production system. |
| title | End-to-end multi-modal product matching in fashion e-commerce |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2403.11593 |