Saved in:
Bibliographic Details
Main Authors: Miguel, Carlos Jimeno, Orduna, Raul, Zola, Francesco
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.09016
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914463268995072
author Miguel, Carlos Jimeno
Orduna, Raul
Zola, Francesco
author_facet Miguel, Carlos Jimeno
Orduna, Raul
Zola, Francesco
contents This study addresses the challenge of creating datasets for cybercrime analysis while complying with the requirements of regulations such as the General Data Protection Regulation (GDPR) and Organic Law 10/1995 of the Penal Code. To this end, a system is proposed for collecting information from the Telegram platform, including text, audio, and images; the implementation of speech-to-text transcription models incorporating signal enhancement techniques; and the evaluation of different Named Entity Recognition (NER) solutions, including Microsoft Presidio and AI models designed using a transformer-based architecture. Experimental results indicate that Parakeet achieves the best performance in audio transcription, while the proposed NER solutions achieve the highest f1-score values in detecting sensitive information. In addition, anonymization metrics are presented that allow evaluation of the preservation of structural coherence in the data, while simultaneously guaranteeing the protection of personal information and supporting cybersecurity research within the current legal framework.
format Preprint
id arxiv_https___arxiv_org_abs_2604_09016
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Identification and Anonymization of Named Entities in Unstructured Information Sources for Use in Social Engineering Detection
Miguel, Carlos Jimeno
Orduna, Raul
Zola, Francesco
Machine Learning
Artificial Intelligence
This study addresses the challenge of creating datasets for cybercrime analysis while complying with the requirements of regulations such as the General Data Protection Regulation (GDPR) and Organic Law 10/1995 of the Penal Code. To this end, a system is proposed for collecting information from the Telegram platform, including text, audio, and images; the implementation of speech-to-text transcription models incorporating signal enhancement techniques; and the evaluation of different Named Entity Recognition (NER) solutions, including Microsoft Presidio and AI models designed using a transformer-based architecture. Experimental results indicate that Parakeet achieves the best performance in audio transcription, while the proposed NER solutions achieve the highest f1-score values in detecting sensitive information. In addition, anonymization metrics are presented that allow evaluation of the preservation of structural coherence in the data, while simultaneously guaranteeing the protection of personal information and supporting cybersecurity research within the current legal framework.
title Identification and Anonymization of Named Entities in Unstructured Information Sources for Use in Social Engineering Detection
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2604.09016