Saved in:
Bibliographic Details
Main Authors: Tsoi, Ho Fung, Rankin, Dylan
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.11719
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914503170457600
author Tsoi, Ho Fung
Rankin, Dylan
author_facet Tsoi, Ho Fung
Rankin, Dylan
contents Self-supervised learning, in the context of foundation model training, is a powerful pre-training method for learning feature representations without labels, which often capture generic underlying semantics from the data and can later be fine-tuned for downstream tasks. In this work, we introduce jBOT, a pre-training method based on self-distillation for jet data from the CERN Large Hadron Collider, which combines local particle-level distillation with global jet-level distillation to learn jet representations that support downstream tasks such as anomaly detection and classification. We observe that pre-training on unlabeled jets leads to emergent semantic class clustering in the representation space. The clustering in the frozen embedding, when pre-trained on background jets only, enables anomaly detection via simple distance-based metrics, and the learned embedding can be fine-tuned for classification with improved performance compared to supervised models trained from scratch.
format Preprint
id arxiv_https___arxiv_org_abs_2601_11719
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle jBOT: Semantic Jet Representation Clustering Emerges from Self-Distillation
Tsoi, Ho Fung
Rankin, Dylan
Machine Learning
High Energy Physics - Experiment
Self-supervised learning, in the context of foundation model training, is a powerful pre-training method for learning feature representations without labels, which often capture generic underlying semantics from the data and can later be fine-tuned for downstream tasks. In this work, we introduce jBOT, a pre-training method based on self-distillation for jet data from the CERN Large Hadron Collider, which combines local particle-level distillation with global jet-level distillation to learn jet representations that support downstream tasks such as anomaly detection and classification. We observe that pre-training on unlabeled jets leads to emergent semantic class clustering in the representation space. The clustering in the frozen embedding, when pre-trained on background jets only, enables anomaly detection via simple distance-based metrics, and the learned embedding can be fine-tuned for classification with improved performance compared to supervised models trained from scratch.
title jBOT: Semantic Jet Representation Clustering Emerges from Self-Distillation
topic Machine Learning
High Energy Physics - Experiment
url https://arxiv.org/abs/2601.11719