Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2403.08420 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866908508857827328 |
|---|---|
| author | Wang, Zhicheng Liang, Wensheng Zhuang, Ruiyan Li, Shuai Tan, Jianwei Ma, Xiaoguang |
| author_facet | Wang, Zhicheng Liang, Wensheng Zhuang, Ruiyan Li, Shuai Tan, Jianwei Ma, Xiaoguang |
| contents | Action recognition (AR) in industrial environments -- particularly for identifying actions and operational gestures -- faces persistent challenges due to high deployment costs, poor cross-scenario generalization, and limited real-time performance. To address these issues, we propose a low-cost real-time framework for industrial action recognition using foundation models, denoted as LRIAR, to enhance recognition accuracy and transferability while minimizing human annotation and computational overhead. The proposed framework constructs an automatically labeled dataset by coupling Grounding DINO with the pretrained BLIP-2 image encoder, enabling efficient and scalable action labeling. Leveraging the constructed dataset, we train YOLOv5 for real-time action detection, and a Vision Transformer (ViT) classifier is deceloped via LoRA-based fine-tuning for action classification. Extensive experiments conducted in real-world industrial settings validate the effectiveness of LRIAR, demonstrating consistent improvements over state-of-the-art methods in recognition accuracy, scenario generalization, and deployment efficiency. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2403_08420 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | ALow-Cost Real-Time Framework for Industrial Action Recognition Using Foundation Models Wang, Zhicheng Liang, Wensheng Zhuang, Ruiyan Li, Shuai Tan, Jianwei Ma, Xiaoguang Computer Vision and Pattern Recognition Action recognition (AR) in industrial environments -- particularly for identifying actions and operational gestures -- faces persistent challenges due to high deployment costs, poor cross-scenario generalization, and limited real-time performance. To address these issues, we propose a low-cost real-time framework for industrial action recognition using foundation models, denoted as LRIAR, to enhance recognition accuracy and transferability while minimizing human annotation and computational overhead. The proposed framework constructs an automatically labeled dataset by coupling Grounding DINO with the pretrained BLIP-2 image encoder, enabling efficient and scalable action labeling. Leveraging the constructed dataset, we train YOLOv5 for real-time action detection, and a Vision Transformer (ViT) classifier is deceloped via LoRA-based fine-tuning for action classification. Extensive experiments conducted in real-world industrial settings validate the effectiveness of LRIAR, demonstrating consistent improvements over state-of-the-art methods in recognition accuracy, scenario generalization, and deployment efficiency. |
| title | ALow-Cost Real-Time Framework for Industrial Action Recognition Using Foundation Models |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2403.08420 |