Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zighem, Mohammed-En-Nadhir, Hadid, Abdenour
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2507.20188
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912504386420736
author	Zighem, Mohammed-En-Nadhir Hadid, Abdenour
author_facet	Zighem, Mohammed-En-Nadhir Hadid, Abdenour
contents	Detecting text in natural scenes remains challenging, particularly for diverse scripts and arbitrarily shaped instances where visual cues alone are often insufficient. Existing methods do not fully leverage semantic context. This paper introduces SAViL-Det, a novel semantic-aware vision-language model that enhances multi-script text detection by effectively integrating textual prompts with visual features. SAViL-Det utilizes a pre-trained CLIP model combined with an Asymptotic Feature Pyramid Network (AFPN) for multi-scale visual feature fusion. The core of the proposed framework is a novel language-vision decoder that adaptively propagates fine-grained semantic information from text prompts to visual features via cross-modal attention. Furthermore, a text-to-pixel contrastive learning mechanism explicitly aligns textual and corresponding visual pixel features. Extensive experiments on challenging benchmarks demonstrate the effectiveness of the proposed approach, achieving state-of-the-art performance with F-scores of 84.8% on the benchmark multi-lingual MLT-2019 dataset and 90.2% on the curved-text CTW1500 dataset.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_20188
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	SAViL-Det: Semantic-Aware Vision-Language Model for Multi-Script Text Detection Zighem, Mohammed-En-Nadhir Hadid, Abdenour Computer Vision and Pattern Recognition Detecting text in natural scenes remains challenging, particularly for diverse scripts and arbitrarily shaped instances where visual cues alone are often insufficient. Existing methods do not fully leverage semantic context. This paper introduces SAViL-Det, a novel semantic-aware vision-language model that enhances multi-script text detection by effectively integrating textual prompts with visual features. SAViL-Det utilizes a pre-trained CLIP model combined with an Asymptotic Feature Pyramid Network (AFPN) for multi-scale visual feature fusion. The core of the proposed framework is a novel language-vision decoder that adaptively propagates fine-grained semantic information from text prompts to visual features via cross-modal attention. Furthermore, a text-to-pixel contrastive learning mechanism explicitly aligns textual and corresponding visual pixel features. Extensive experiments on challenging benchmarks demonstrate the effectiveness of the proposed approach, achieving state-of-the-art performance with F-scores of 84.8% on the benchmark multi-lingual MLT-2019 dataset and 90.2% on the curved-text CTW1500 dataset.
title	SAViL-Det: Semantic-Aware Vision-Language Model for Multi-Script Text Detection
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2507.20188

Similar Items