Saved in:
Bibliographic Details
Main Authors: Gondal, Moazzam Umer, Qudous, Hamad Ul, Siddiqui, Daniya, Farhan, Asma Ahmad
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.19149
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917101338361856
author Gondal, Moazzam Umer
Qudous, Hamad Ul
Siddiqui, Daniya
Farhan, Asma Ahmad
author_facet Gondal, Moazzam Umer
Qudous, Hamad Ul
Siddiqui, Daniya
Farhan, Asma Ahmad
contents This paper introduces the retrieval-augmented framework for automatic fashion caption and hashtag generation, combining multi-garment detection, attribute reasoning, and Large Language Model (LLM) prompting. The system aims to produce visually grounded, descriptive, and stylistically interesting text for fashion imagery, overcoming the limitations of end-to-end captioners that have problems with attribute fidelity and domain generalization. The pipeline combines a YOLO-based detector for multi-garment localization, k-means clustering for dominant color extraction, and a CLIP-FAISS retrieval module for fabric and gender attribute inference based on a structured product index. These attributes, together with retrieved style examples, create a factual evidence pack that is used to guide an LLM to generate human-like captions and contextually rich hashtags. A fine-tuned BLIP model is used as a supervised baseline model for comparison. Experimental results show that the YOLO detector is able to obtain a mean Average Precision (mAP@0.5) of 0.71 for nine categories of garments. The RAG-LLM pipeline generates expressive attribute-aligned captions and achieves mean attribute coverage of 0.80 with full coverage at the 50% threshold in hashtag generation, whereas BLIP gives higher lexical overlap and lower generalization. The retrieval-augmented approach exhibits better factual grounding, less hallucination, and great potential for scalable deployment in various clothing domains. These results demonstrate the use of retrieval-augmented generation as an effective and interpretable paradigm for automated and visually grounded fashion content generation.
format Preprint
id arxiv_https___arxiv_org_abs_2511_19149
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation
Gondal, Moazzam Umer
Qudous, Hamad Ul
Siddiqui, Daniya
Farhan, Asma Ahmad
Computer Vision and Pattern Recognition
Artificial Intelligence
Computation and Language
This paper introduces the retrieval-augmented framework for automatic fashion caption and hashtag generation, combining multi-garment detection, attribute reasoning, and Large Language Model (LLM) prompting. The system aims to produce visually grounded, descriptive, and stylistically interesting text for fashion imagery, overcoming the limitations of end-to-end captioners that have problems with attribute fidelity and domain generalization. The pipeline combines a YOLO-based detector for multi-garment localization, k-means clustering for dominant color extraction, and a CLIP-FAISS retrieval module for fabric and gender attribute inference based on a structured product index. These attributes, together with retrieved style examples, create a factual evidence pack that is used to guide an LLM to generate human-like captions and contextually rich hashtags. A fine-tuned BLIP model is used as a supervised baseline model for comparison. Experimental results show that the YOLO detector is able to obtain a mean Average Precision (mAP@0.5) of 0.71 for nine categories of garments. The RAG-LLM pipeline generates expressive attribute-aligned captions and achieves mean attribute coverage of 0.80 with full coverage at the 50% threshold in hashtag generation, whereas BLIP gives higher lexical overlap and lower generalization. The retrieval-augmented approach exhibits better factual grounding, less hallucination, and great potential for scalable deployment in various clothing domains. These results demonstrate the use of retrieval-augmented generation as an effective and interpretable paradigm for automated and visually grounded fashion content generation.
title From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation
topic Computer Vision and Pattern Recognition
Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2511.19149