Saved in:
Bibliographic Details
Main Authors: Ramadan, Tarek, Abdou, AbdelRahman, Mannan, Mohammad, Youssef, Amr
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.21826
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914350395031552
author Ramadan, Tarek
Abdou, AbdelRahman
Mannan, Mohammad
Youssef, Amr
author_facet Ramadan, Tarek
Abdou, AbdelRahman
Mannan, Mohammad
Youssef, Amr
contents A large number of URLs are made public by various platforms for security analysis, archiving, and paste sharing -- such as VirusTotal, URLScan.io, Hybrid Analysis, the Wayback Machine, and RedHunt. These services may unintentionally expose links containing sensitive information, as reported in some news articles and blog posts. However, no large-scale measurement has quantified the extent of such exposures. We present an automated system that detects and analyzes potential sensitive information leaked through publicly accessible URLs. The system combines lexical URL filtering, dynamic rendering, OCR-based extraction, and content classification to identify potential leaks. We apply it to 6,094,475 URLs collected from public scanning platforms, paste sites, and web archives, identifying 12,331 potential exposures across authentication, financial, personal, and document-related domains. These findings show that sensitive information remains exposed, underscoring the importance of automated detection to identify accidental leaks.
format Preprint
id arxiv_https___arxiv_org_abs_2602_21826
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle The Silent Spill: Measuring Sensitive Data Leaks Across Public URL Repositories
Ramadan, Tarek
Abdou, AbdelRahman
Mannan, Mohammad
Youssef, Amr
Cryptography and Security
A large number of URLs are made public by various platforms for security analysis, archiving, and paste sharing -- such as VirusTotal, URLScan.io, Hybrid Analysis, the Wayback Machine, and RedHunt. These services may unintentionally expose links containing sensitive information, as reported in some news articles and blog posts. However, no large-scale measurement has quantified the extent of such exposures. We present an automated system that detects and analyzes potential sensitive information leaked through publicly accessible URLs. The system combines lexical URL filtering, dynamic rendering, OCR-based extraction, and content classification to identify potential leaks. We apply it to 6,094,475 URLs collected from public scanning platforms, paste sites, and web archives, identifying 12,331 potential exposures across authentication, financial, personal, and document-related domains. These findings show that sensitive information remains exposed, underscoring the importance of automated detection to identify accidental leaks.
title The Silent Spill: Measuring Sensitive Data Leaks Across Public URL Repositories
topic Cryptography and Security
url https://arxiv.org/abs/2602.21826