Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Khor, Yin-Loon, Wong, Yi-Jie, Hum, Yan Chai
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2604.03172
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918426765688832
author	Khor, Yin-Loon Wong, Yi-Jie Hum, Yan Chai
author_facet	Khor, Yin-Loon Wong, Yi-Jie Hum, Yan Chai
contents	Predicting product quality from multimodal item information is critical in cold-start scenarios, where user interaction history is unavailable and predictions must rely on images and textual metadata. However, existing vision-language models typically depend on large architectures and/or extensive external datasets, resulting in high computational cost. To address this, we propose EffiMiniVLM, a compact dual-encoder vision-language regression framework that integrates an EfficientNet-B0 image encoder and a MiniLM-based text encoder with a lightweight regression head. To improve training sample efficiency, we introduce a weighted Huber loss that leverages rating counts to emphasize more reliable samples, yielding consistent performance gains. Trained using only 20% of the Amazon Reviews 2023 dataset, the proposed model contains 27.7M parameters and requires 6.8 GFLOPs, yet achieves a CES score of 0.40 with the lowest resource cost in the benchmark. Despite its small size, it remains competitive with significantly larger models, achieving comparable performance while being approximately 4x to 8x more resource-efficient than other top-5 methods and being the only approach that does not use external datasets. Further analysis shows that scaling the data to 40% alone allows our model to overtake other methods, which use larger models and datasets, highlighting strong scalability despite the model's compact design.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_03172
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	EffiMiniVLM: A Compact Dual-Encoder Regression Framework Khor, Yin-Loon Wong, Yi-Jie Hum, Yan Chai Computer Vision and Pattern Recognition Predicting product quality from multimodal item information is critical in cold-start scenarios, where user interaction history is unavailable and predictions must rely on images and textual metadata. However, existing vision-language models typically depend on large architectures and/or extensive external datasets, resulting in high computational cost. To address this, we propose EffiMiniVLM, a compact dual-encoder vision-language regression framework that integrates an EfficientNet-B0 image encoder and a MiniLM-based text encoder with a lightweight regression head. To improve training sample efficiency, we introduce a weighted Huber loss that leverages rating counts to emphasize more reliable samples, yielding consistent performance gains. Trained using only 20% of the Amazon Reviews 2023 dataset, the proposed model contains 27.7M parameters and requires 6.8 GFLOPs, yet achieves a CES score of 0.40 with the lowest resource cost in the benchmark. Despite its small size, it remains competitive with significantly larger models, achieving comparable performance while being approximately 4x to 8x more resource-efficient than other top-5 methods and being the only approach that does not use external datasets. Further analysis shows that scaling the data to 40% alone allows our model to overtake other methods, which use larger models and datasets, highlighting strong scalability despite the model's compact design.
title	EffiMiniVLM: A Compact Dual-Encoder Regression Framework
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2604.03172

Similar Items