Saved in:
Bibliographic Details
Main Authors: Wang, Zhicheng, Liang, Wensheng, Zhuang, Ruiyan, Li, Shuai, Tan, Jianwei, Ma, Xiaoguang
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2403.08420
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908508857827328
author Wang, Zhicheng
Liang, Wensheng
Zhuang, Ruiyan
Li, Shuai
Tan, Jianwei
Ma, Xiaoguang
author_facet Wang, Zhicheng
Liang, Wensheng
Zhuang, Ruiyan
Li, Shuai
Tan, Jianwei
Ma, Xiaoguang
contents Action recognition (AR) in industrial environments -- particularly for identifying actions and operational gestures -- faces persistent challenges due to high deployment costs, poor cross-scenario generalization, and limited real-time performance. To address these issues, we propose a low-cost real-time framework for industrial action recognition using foundation models, denoted as LRIAR, to enhance recognition accuracy and transferability while minimizing human annotation and computational overhead. The proposed framework constructs an automatically labeled dataset by coupling Grounding DINO with the pretrained BLIP-2 image encoder, enabling efficient and scalable action labeling. Leveraging the constructed dataset, we train YOLOv5 for real-time action detection, and a Vision Transformer (ViT) classifier is deceloped via LoRA-based fine-tuning for action classification. Extensive experiments conducted in real-world industrial settings validate the effectiveness of LRIAR, demonstrating consistent improvements over state-of-the-art methods in recognition accuracy, scenario generalization, and deployment efficiency.
format Preprint
id arxiv_https___arxiv_org_abs_2403_08420
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle ALow-Cost Real-Time Framework for Industrial Action Recognition Using Foundation Models
Wang, Zhicheng
Liang, Wensheng
Zhuang, Ruiyan
Li, Shuai
Tan, Jianwei
Ma, Xiaoguang
Computer Vision and Pattern Recognition
Action recognition (AR) in industrial environments -- particularly for identifying actions and operational gestures -- faces persistent challenges due to high deployment costs, poor cross-scenario generalization, and limited real-time performance. To address these issues, we propose a low-cost real-time framework for industrial action recognition using foundation models, denoted as LRIAR, to enhance recognition accuracy and transferability while minimizing human annotation and computational overhead. The proposed framework constructs an automatically labeled dataset by coupling Grounding DINO with the pretrained BLIP-2 image encoder, enabling efficient and scalable action labeling. Leveraging the constructed dataset, we train YOLOv5 for real-time action detection, and a Vision Transformer (ViT) classifier is deceloped via LoRA-based fine-tuning for action classification. Extensive experiments conducted in real-world industrial settings validate the effectiveness of LRIAR, demonstrating consistent improvements over state-of-the-art methods in recognition accuracy, scenario generalization, and deployment efficiency.
title ALow-Cost Real-Time Framework for Industrial Action Recognition Using Foundation Models
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2403.08420