Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Cui, Zhe, Li, Yuli, Tran, Le-Nam
Format:	Preprint
Veröffentlicht:	2025
Schlagworte:	Computer Vision and Pattern Recognition Machine Learning
Online-Zugang:	https://arxiv.org/abs/2504.20178
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866918003774324736
author	Cui, Zhe Li, Yuli Tran, Le-Nam
author_facet	Cui, Zhe Li, Yuli Tran, Le-Nam
contents	Current crowd-counting models often rely on single-modal inputs, such as visual images or wireless signal data, which can result in significant information loss and suboptimal recognition performance. To address these shortcomings, we propose TransFusion, a novel multimodal fusion-based crowd-counting model that integrates Channel State Information (CSI) with image data. By leveraging the powerful capabilities of Transformer networks, TransFusion effectively combines these two distinct data modalities, enabling the capture of comprehensive global contextual information that is critical for accurate crowd estimation. However, while transformers are well capable of capturing global features, they potentially fail to identify finer-grained, local details essential for precise crowd counting. To mitigate this, we incorporate Convolutional Neural Networks (CNNs) into the model architecture, enhancing its ability to extract detailed local features that complement the global context provided by the Transformer. Extensive experimental evaluations demonstrate that TransFusion achieves high accuracy with minimal counting errors while maintaining superior efficiency.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_20178
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	A Transformer-based Multimodal Fusion Model for Efficient Crowd Counting Using Visual and Wireless Signals Cui, Zhe Li, Yuli Tran, Le-Nam Computer Vision and Pattern Recognition Machine Learning Current crowd-counting models often rely on single-modal inputs, such as visual images or wireless signal data, which can result in significant information loss and suboptimal recognition performance. To address these shortcomings, we propose TransFusion, a novel multimodal fusion-based crowd-counting model that integrates Channel State Information (CSI) with image data. By leveraging the powerful capabilities of Transformer networks, TransFusion effectively combines these two distinct data modalities, enabling the capture of comprehensive global contextual information that is critical for accurate crowd estimation. However, while transformers are well capable of capturing global features, they potentially fail to identify finer-grained, local details essential for precise crowd counting. To mitigate this, we incorporate Convolutional Neural Networks (CNNs) into the model architecture, enhancing its ability to extract detailed local features that complement the global context provided by the Transformer. Extensive experimental evaluations demonstrate that TransFusion achieves high accuracy with minimal counting errors while maintaining superior efficiency.
title	A Transformer-based Multimodal Fusion Model for Efficient Crowd Counting Using Visual and Wireless Signals
topic	Computer Vision and Pattern Recognition Machine Learning
url	https://arxiv.org/abs/2504.20178

Ähnliche Einträge