Saved in:
Bibliographic Details
Main Authors: Zighem, Mohammed-En-Nadhir, Hadid, Abdenour
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.20188
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912504386420736
author Zighem, Mohammed-En-Nadhir
Hadid, Abdenour
author_facet Zighem, Mohammed-En-Nadhir
Hadid, Abdenour
contents Detecting text in natural scenes remains challenging, particularly for diverse scripts and arbitrarily shaped instances where visual cues alone are often insufficient. Existing methods do not fully leverage semantic context. This paper introduces SAViL-Det, a novel semantic-aware vision-language model that enhances multi-script text detection by effectively integrating textual prompts with visual features. SAViL-Det utilizes a pre-trained CLIP model combined with an Asymptotic Feature Pyramid Network (AFPN) for multi-scale visual feature fusion. The core of the proposed framework is a novel language-vision decoder that adaptively propagates fine-grained semantic information from text prompts to visual features via cross-modal attention. Furthermore, a text-to-pixel contrastive learning mechanism explicitly aligns textual and corresponding visual pixel features. Extensive experiments on challenging benchmarks demonstrate the effectiveness of the proposed approach, achieving state-of-the-art performance with F-scores of 84.8% on the benchmark multi-lingual MLT-2019 dataset and 90.2% on the curved-text CTW1500 dataset.
format Preprint
id arxiv_https___arxiv_org_abs_2507_20188
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle SAViL-Det: Semantic-Aware Vision-Language Model for Multi-Script Text Detection
Zighem, Mohammed-En-Nadhir
Hadid, Abdenour
Computer Vision and Pattern Recognition
Detecting text in natural scenes remains challenging, particularly for diverse scripts and arbitrarily shaped instances where visual cues alone are often insufficient. Existing methods do not fully leverage semantic context. This paper introduces SAViL-Det, a novel semantic-aware vision-language model that enhances multi-script text detection by effectively integrating textual prompts with visual features. SAViL-Det utilizes a pre-trained CLIP model combined with an Asymptotic Feature Pyramid Network (AFPN) for multi-scale visual feature fusion. The core of the proposed framework is a novel language-vision decoder that adaptively propagates fine-grained semantic information from text prompts to visual features via cross-modal attention. Furthermore, a text-to-pixel contrastive learning mechanism explicitly aligns textual and corresponding visual pixel features. Extensive experiments on challenging benchmarks demonstrate the effectiveness of the proposed approach, achieving state-of-the-art performance with F-scores of 84.8% on the benchmark multi-lingual MLT-2019 dataset and 90.2% on the curved-text CTW1500 dataset.
title SAViL-Det: Semantic-Aware Vision-Language Model for Multi-Script Text Detection
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2507.20188