Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Hu, Xiaowan, Chen, Yiyi, Li, Yan, Wang, Minquan, Wang, Haoqian, Chen, Quan, Li, Han, Jiang, Peng
Format:	Preprint
Veröffentlicht:	2024
Schlagworte:	Computer Vision and Pattern Recognition Multimedia
Online-Zugang:	https://arxiv.org/abs/2407.16248
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866913457362698240
author	Hu, Xiaowan Chen, Yiyi Li, Yan Wang, Minquan Wang, Haoqian Chen, Quan Li, Han Jiang, Peng
author_facet	Hu, Xiaowan Chen, Yiyi Li, Yan Wang, Minquan Wang, Haoqian Chen, Quan Li, Han Jiang, Peng
contents	With the rapid expansion of e-commerce, more consumers have become accustomed to making purchases via livestreaming. Accurately identifying the products being sold by salespeople, i.e., livestreaming product retrieval (LPR), poses a fundamental and daunting challenge. The LPR task encompasses three primary dilemmas in real-world scenarios: 1) the recognition of intended products from distractor products present in the background; 2) the video-image heterogeneity that the appearance of products showcased in live streams often deviates substantially from standardized product images in stores; 3) there are numerous confusing products with subtle visual nuances in the shop. To tackle these challenges, we propose the Spatiotemporal Graphing Multi-modal Network (SGMN). First, we employ a text-guided attention mechanism that leverages the spoken content of salespeople to guide the model to focus toward intended products, emphasizing their salience over cluttered background products. Second, a long-range spatiotemporal graph network is further designed to achieve both instance-level interaction and frame-level matching, solving the misalignment caused by video-image heterogeneity. Third, we propose a multi-modal hard example mining, assisting the model in distinguishing highly similar products with fine-grained features across the video-image-text domain. Through extensive quantitative and qualitative experiments, we demonstrate the superior performance of our proposed SGMN model, surpassing the state-of-the-art methods by a substantial margin. The code is available at https://github.com/Huxiaowan/SGMN.
format	Preprint
id	arxiv_https___arxiv_org_abs_2407_16248
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Spatiotemporal Graph Guided Multi-modal Network for Livestreaming Product Retrieval Hu, Xiaowan Chen, Yiyi Li, Yan Wang, Minquan Wang, Haoqian Chen, Quan Li, Han Jiang, Peng Computer Vision and Pattern Recognition Multimedia With the rapid expansion of e-commerce, more consumers have become accustomed to making purchases via livestreaming. Accurately identifying the products being sold by salespeople, i.e., livestreaming product retrieval (LPR), poses a fundamental and daunting challenge. The LPR task encompasses three primary dilemmas in real-world scenarios: 1) the recognition of intended products from distractor products present in the background; 2) the video-image heterogeneity that the appearance of products showcased in live streams often deviates substantially from standardized product images in stores; 3) there are numerous confusing products with subtle visual nuances in the shop. To tackle these challenges, we propose the Spatiotemporal Graphing Multi-modal Network (SGMN). First, we employ a text-guided attention mechanism that leverages the spoken content of salespeople to guide the model to focus toward intended products, emphasizing their salience over cluttered background products. Second, a long-range spatiotemporal graph network is further designed to achieve both instance-level interaction and frame-level matching, solving the misalignment caused by video-image heterogeneity. Third, we propose a multi-modal hard example mining, assisting the model in distinguishing highly similar products with fine-grained features across the video-image-text domain. Through extensive quantitative and qualitative experiments, we demonstrate the superior performance of our proposed SGMN model, surpassing the state-of-the-art methods by a substantial margin. The code is available at https://github.com/Huxiaowan/SGMN.
title	Spatiotemporal Graph Guided Multi-modal Network for Livestreaming Product Retrieval
topic	Computer Vision and Pattern Recognition Multimedia
url	https://arxiv.org/abs/2407.16248

Ähnliche Einträge