Saved in:
Bibliographic Details
Main Authors: Wauyo, Peter, Bwiza, Dalia, Murara, Alain, Mugume, Edwin, Umuhoza, Eric
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.02165
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916985698254848
author Wauyo, Peter
Bwiza, Dalia
Murara, Alain
Mugume, Edwin
Umuhoza, Eric
author_facet Wauyo, Peter
Bwiza, Dalia
Murara, Alain
Mugume, Edwin
Umuhoza, Eric
contents This research introduces a multimodal system designed to detect fraud and fare evasion in public transportation by analyzing closed circuit television (CCTV) and audio data. The proposed solution uses the Vision Transformer for Video (ViViT) model for video feature extraction and the Audio Spectrogram Transformer (AST) for audio analysis. The system implements a Tensor Fusion Network (TFN) architecture that explicitly models unimodal and bimodal interactions through a 2-fold Cartesian product. This advanced fusion technique captures complex cross-modal dynamics between visual behaviors (e.g., tailgating,unauthorized access) and audio cues (e.g., fare transaction sounds). The system was trained and tested on a custom dataset, achieving an accuracy of 89.5%, precision of 87.2%, and recall of 84.0% in detecting fraudulent activities, significantly outperforming early fusion baselines and exceeding the 75% recall rates typically reported in state-of-the-art transportation fraud detection systems. Our ablation studies demonstrate that the tensor fusion approach provides a 7.0% improvement in the F1 score and an 8.8% boost in recall compared to traditional concatenation methods. The solution supports real-time detection, enabling public transport operators to reduce revenue loss, improve passenger safety, and ensure operational compliance.
format Preprint
id arxiv_https___arxiv_org_abs_2510_02165
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Towards fairer public transit: Real-time tensor-based multimodal fare evasion and fraud detection
Wauyo, Peter
Bwiza, Dalia
Murara, Alain
Mugume, Edwin
Umuhoza, Eric
Software Engineering
This research introduces a multimodal system designed to detect fraud and fare evasion in public transportation by analyzing closed circuit television (CCTV) and audio data. The proposed solution uses the Vision Transformer for Video (ViViT) model for video feature extraction and the Audio Spectrogram Transformer (AST) for audio analysis. The system implements a Tensor Fusion Network (TFN) architecture that explicitly models unimodal and bimodal interactions through a 2-fold Cartesian product. This advanced fusion technique captures complex cross-modal dynamics between visual behaviors (e.g., tailgating,unauthorized access) and audio cues (e.g., fare transaction sounds). The system was trained and tested on a custom dataset, achieving an accuracy of 89.5%, precision of 87.2%, and recall of 84.0% in detecting fraudulent activities, significantly outperforming early fusion baselines and exceeding the 75% recall rates typically reported in state-of-the-art transportation fraud detection systems. Our ablation studies demonstrate that the tensor fusion approach provides a 7.0% improvement in the F1 score and an 8.8% boost in recall compared to traditional concatenation methods. The solution supports real-time detection, enabling public transport operators to reduce revenue loss, improve passenger safety, and ensure operational compliance.
title Towards fairer public transit: Real-time tensor-based multimodal fare evasion and fraud detection
topic Software Engineering
url https://arxiv.org/abs/2510.02165