Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Mohammed, Waleed Khan, Anuar, Zahirul Arief Irfan Bin Shahrul, Mitani, Mousa Sufian Mousa, Karim, Hezerul Abdul, AlDahoul, Nouar
Format:	Preprint
Published:	2026
Subjects:	Cryptography and Security Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.00204
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910006670000128
author	Mohammed, Waleed Khan Anuar, Zahirul Arief Irfan Bin Shahrul Mitani, Mousa Sufian Mousa Karim, Hezerul Abdul AlDahoul, Nouar
author_facet	Mohammed, Waleed Khan Anuar, Zahirul Arief Irfan Bin Shahrul Mitani, Mousa Sufian Mousa Karim, Hezerul Abdul AlDahoul, Nouar
contents	Advanced Persistent Threats (APTs) are among the most challenging cyberattacks to detect. They are carried out by highly skilled attackers who carefully study their targets and operate in a stealthy, long-term manner. Because APTs exhibit "low-and-slow" behavior, traditional statistical methods and shallow machine learning techniques often fail to detect them. Previous research on APT detection has explored machine learning approaches and provenance graph analysis. However, provenance-based methods often fail to capture the semantic intent behind system activities. This paper proposes a novel anomaly detection approach that leverages semantic embeddings generated by Large Language Models (LLMs). The method enhances APT detection by extracting meaningful semantic representations from unstructured system log data. First, raw system logs are transformed into high-dimensional semantic embeddings using a pre-trained transformer model. These embeddings are then analyzed using an Autoencoder (AE) to identify anomalous and potentially malicious patterns. The proposed method is evaluated using the DARPA Transparent Computing (TC) dataset, which contains realistic APT attack scenarios generated by red teams in live environments. Experimental results show that the AE trained on LLM-derived embeddings outperforms widely used unsupervised baseline methods, including Isolation Forest (IForest), One-Class Support Vector Machine (OC-SVM), and Principal Component Analysis (PCA). Performance is measured using the Area Under the Receiver Operating Characteristic Curve (AUC-ROC), where the proposed approach consistently achieves superior results, even in complex threat scenarios. These findings highlight the importance of semantic understanding in detecting non-linear and stealthy attack behaviors that are often missed by conventional detection techniques.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_00204
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Semantic-Aware Advanced Persistent Threat Detection Using Autoencoders on LLM-Encoded System Logs Mohammed, Waleed Khan Anuar, Zahirul Arief Irfan Bin Shahrul Mitani, Mousa Sufian Mousa Karim, Hezerul Abdul AlDahoul, Nouar Cryptography and Security Artificial Intelligence Advanced Persistent Threats (APTs) are among the most challenging cyberattacks to detect. They are carried out by highly skilled attackers who carefully study their targets and operate in a stealthy, long-term manner. Because APTs exhibit "low-and-slow" behavior, traditional statistical methods and shallow machine learning techniques often fail to detect them. Previous research on APT detection has explored machine learning approaches and provenance graph analysis. However, provenance-based methods often fail to capture the semantic intent behind system activities. This paper proposes a novel anomaly detection approach that leverages semantic embeddings generated by Large Language Models (LLMs). The method enhances APT detection by extracting meaningful semantic representations from unstructured system log data. First, raw system logs are transformed into high-dimensional semantic embeddings using a pre-trained transformer model. These embeddings are then analyzed using an Autoencoder (AE) to identify anomalous and potentially malicious patterns. The proposed method is evaluated using the DARPA Transparent Computing (TC) dataset, which contains realistic APT attack scenarios generated by red teams in live environments. Experimental results show that the AE trained on LLM-derived embeddings outperforms widely used unsupervised baseline methods, including Isolation Forest (IForest), One-Class Support Vector Machine (OC-SVM), and Principal Component Analysis (PCA). Performance is measured using the Area Under the Receiver Operating Characteristic Curve (AUC-ROC), where the proposed approach consistently achieves superior results, even in complex threat scenarios. These findings highlight the importance of semantic understanding in detecting non-linear and stealthy attack behaviors that are often missed by conventional detection techniques.
title	Semantic-Aware Advanced Persistent Threat Detection Using Autoencoders on LLM-Encoded System Logs
topic	Cryptography and Security Artificial Intelligence
url	https://arxiv.org/abs/2602.00204

Similar Items