Saved in:
Bibliographic Details
Main Authors: Yakymovych, Andrey, Singh, Abhishek
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2407.08888
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929418370285568
author Yakymovych, Andrey
Singh, Abhishek
author_facet Yakymovych, Andrey
Singh, Abhishek
contents Recent threat reports highlight that email remains the top vector for delivering malware to endpoints. Despite these statistics, detecting malicious email attachments and URLs often neglects semantic cues linguistic features and contextual clues. Our study employs BERTopic unsupervised topic modeling to identify common semantics and themes embedded in email to deliver malicious attachments and call-to-action URLs. We preprocess emails by extracting and sanitizing content and employ multilingual embedding models like BGE-M3 for dense representations, which clustering algorithms(HDBSCAN and OPTICS) use to group emails by semantic similarity. Phi3-Mini-4K-Instruct facilitates semantic and hLDA aid in thematic analysis to understand threat actor patterns. Our research will evaluate and compare different clustering algorithms on topic quantity, coherence, and diversity metrics, concluding with insights into the semantics and topics commonly used by threat actors to deliver malicious attachments and URLs, a significant contribution to the field of threat detection.
format Preprint
id arxiv_https___arxiv_org_abs_2407_08888
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Uncovering Semantics and Topics Utilized by Threat Actors to Deliver Malicious Attachments and URLs
Yakymovych, Andrey
Singh, Abhishek
Machine Learning
Recent threat reports highlight that email remains the top vector for delivering malware to endpoints. Despite these statistics, detecting malicious email attachments and URLs often neglects semantic cues linguistic features and contextual clues. Our study employs BERTopic unsupervised topic modeling to identify common semantics and themes embedded in email to deliver malicious attachments and call-to-action URLs. We preprocess emails by extracting and sanitizing content and employ multilingual embedding models like BGE-M3 for dense representations, which clustering algorithms(HDBSCAN and OPTICS) use to group emails by semantic similarity. Phi3-Mini-4K-Instruct facilitates semantic and hLDA aid in thematic analysis to understand threat actor patterns. Our research will evaluate and compare different clustering algorithms on topic quantity, coherence, and diversity metrics, concluding with insights into the semantics and topics commonly used by threat actors to deliver malicious attachments and URLs, a significant contribution to the field of threat detection.
title Uncovering Semantics and Topics Utilized by Threat Actors to Deliver Malicious Attachments and URLs
topic Machine Learning
url https://arxiv.org/abs/2407.08888