Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Zhicheng, Liang, Wensheng, Zhuang, Ruiyan, Li, Shuai, Tan, Jianwei, Ma, Xiaoguang
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2403.08420
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908508857827328
author	Wang, Zhicheng Liang, Wensheng Zhuang, Ruiyan Li, Shuai Tan, Jianwei Ma, Xiaoguang
author_facet	Wang, Zhicheng Liang, Wensheng Zhuang, Ruiyan Li, Shuai Tan, Jianwei Ma, Xiaoguang
contents	Action recognition (AR) in industrial environments -- particularly for identifying actions and operational gestures -- faces persistent challenges due to high deployment costs, poor cross-scenario generalization, and limited real-time performance. To address these issues, we propose a low-cost real-time framework for industrial action recognition using foundation models, denoted as LRIAR, to enhance recognition accuracy and transferability while minimizing human annotation and computational overhead. The proposed framework constructs an automatically labeled dataset by coupling Grounding DINO with the pretrained BLIP-2 image encoder, enabling efficient and scalable action labeling. Leveraging the constructed dataset, we train YOLOv5 for real-time action detection, and a Vision Transformer (ViT) classifier is deceloped via LoRA-based fine-tuning for action classification. Extensive experiments conducted in real-world industrial settings validate the effectiveness of LRIAR, demonstrating consistent improvements over state-of-the-art methods in recognition accuracy, scenario generalization, and deployment efficiency.
format	Preprint
id	arxiv_https___arxiv_org_abs_2403_08420
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	ALow-Cost Real-Time Framework for Industrial Action Recognition Using Foundation Models Wang, Zhicheng Liang, Wensheng Zhuang, Ruiyan Li, Shuai Tan, Jianwei Ma, Xiaoguang Computer Vision and Pattern Recognition Action recognition (AR) in industrial environments -- particularly for identifying actions and operational gestures -- faces persistent challenges due to high deployment costs, poor cross-scenario generalization, and limited real-time performance. To address these issues, we propose a low-cost real-time framework for industrial action recognition using foundation models, denoted as LRIAR, to enhance recognition accuracy and transferability while minimizing human annotation and computational overhead. The proposed framework constructs an automatically labeled dataset by coupling Grounding DINO with the pretrained BLIP-2 image encoder, enabling efficient and scalable action labeling. Leveraging the constructed dataset, we train YOLOv5 for real-time action detection, and a Vision Transformer (ViT) classifier is deceloped via LoRA-based fine-tuning for action classification. Extensive experiments conducted in real-world industrial settings validate the effectiveness of LRIAR, demonstrating consistent improvements over state-of-the-art methods in recognition accuracy, scenario generalization, and deployment efficiency.
title	ALow-Cost Real-Time Framework for Industrial Action Recognition Using Foundation Models
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2403.08420

Similar Items