Saved in:
Bibliographic Details
Main Authors: Chen, Qinghui, Zhang, Zekai, Zhang, Zaigui, Zhang, Kai, Li, Dagang, Wang, Wenmin, Zhang, Jinglin, Liu, Cong
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.26735
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912985460506624
author Chen, Qinghui
Zhang, Zekai
Zhang, Zaigui
Zhang, Kai
Li, Dagang
Wang, Wenmin
Zhang, Jinglin
Liu, Cong
author_facet Chen, Qinghui
Zhang, Zekai
Zhang, Zaigui
Zhang, Kai
Li, Dagang
Wang, Wenmin
Zhang, Jinglin
Liu, Cong
contents High inter-class similarity, extreme scale variation, and limited computational budgets hinder reliable visual recognition across diverse real-world data. Existing vision-centric and cross-modal approaches often rely on rigid fusion mechanisms and heavy annotation pipelines, leading to sub-optimal generalization. We propose the Distilled Large Language Model (LLM)-Driven Sparse Mixture-of-Experts (DS-MoE) framework, which integrates text-guided dynamic routing and lightweight multi-scale comprehension. The DS-MoE framework dynamically aligns textual semantics with defect-specific visual patterns through a sparse MoE architecture, where task-relevant experts are adaptively activated based on semantic relevance, resolving inter-class ambiguity. A lightweight MobileSAM encoder enables real-time inference while preserving multi-scale defect details. Extensive experiments on PCB, aluminum foil, and mold defect datasets demonstrate that our framework achieves superior performance compared to existing pure vision models. \textbf{DS-MoE} surpasses YOLOv8/YOLOX with gains of +13.9, +1.4, and +2.0 pp mAP@ 0.5:0.95 on BBMP, aluminum, and PCB, respectively, while also improving precision and recall.
format Preprint
id arxiv_https___arxiv_org_abs_2603_26735
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Distilled Large Language Model-Driven Dynamic Sparse Expert Activation Mechanism
Chen, Qinghui
Zhang, Zekai
Zhang, Zaigui
Zhang, Kai
Li, Dagang
Wang, Wenmin
Zhang, Jinglin
Liu, Cong
Computer Vision and Pattern Recognition
Artificial Intelligence
High inter-class similarity, extreme scale variation, and limited computational budgets hinder reliable visual recognition across diverse real-world data. Existing vision-centric and cross-modal approaches often rely on rigid fusion mechanisms and heavy annotation pipelines, leading to sub-optimal generalization. We propose the Distilled Large Language Model (LLM)-Driven Sparse Mixture-of-Experts (DS-MoE) framework, which integrates text-guided dynamic routing and lightweight multi-scale comprehension. The DS-MoE framework dynamically aligns textual semantics with defect-specific visual patterns through a sparse MoE architecture, where task-relevant experts are adaptively activated based on semantic relevance, resolving inter-class ambiguity. A lightweight MobileSAM encoder enables real-time inference while preserving multi-scale defect details. Extensive experiments on PCB, aluminum foil, and mold defect datasets demonstrate that our framework achieves superior performance compared to existing pure vision models. \textbf{DS-MoE} surpasses YOLOv8/YOLOX with gains of +13.9, +1.4, and +2.0 pp mAP@ 0.5:0.95 on BBMP, aluminum, and PCB, respectively, while also improving precision and recall.
title Distilled Large Language Model-Driven Dynamic Sparse Expert Activation Mechanism
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2603.26735