Saved in:
| Main Authors: | , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.11719 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866914503170457600 |
|---|---|
| author | Tsoi, Ho Fung Rankin, Dylan |
| author_facet | Tsoi, Ho Fung Rankin, Dylan |
| contents | Self-supervised learning, in the context of foundation model training, is a powerful pre-training method for learning feature representations without labels, which often capture generic underlying semantics from the data and can later be fine-tuned for downstream tasks. In this work, we introduce jBOT, a pre-training method based on self-distillation for jet data from the CERN Large Hadron Collider, which combines local particle-level distillation with global jet-level distillation to learn jet representations that support downstream tasks such as anomaly detection and classification. We observe that pre-training on unlabeled jets leads to emergent semantic class clustering in the representation space. The clustering in the frozen embedding, when pre-trained on background jets only, enables anomaly detection via simple distance-based metrics, and the learned embedding can be fine-tuned for classification with improved performance compared to supervised models trained from scratch. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2601_11719 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | jBOT: Semantic Jet Representation Clustering Emerges from Self-Distillation Tsoi, Ho Fung Rankin, Dylan Machine Learning High Energy Physics - Experiment Self-supervised learning, in the context of foundation model training, is a powerful pre-training method for learning feature representations without labels, which often capture generic underlying semantics from the data and can later be fine-tuned for downstream tasks. In this work, we introduce jBOT, a pre-training method based on self-distillation for jet data from the CERN Large Hadron Collider, which combines local particle-level distillation with global jet-level distillation to learn jet representations that support downstream tasks such as anomaly detection and classification. We observe that pre-training on unlabeled jets leads to emergent semantic class clustering in the representation space. The clustering in the frozen embedding, when pre-trained on background jets only, enables anomaly detection via simple distance-based metrics, and the learned embedding can be fine-tuned for classification with improved performance compared to supervised models trained from scratch. |
| title | jBOT: Semantic Jet Representation Clustering Emerges from Self-Distillation |
| topic | Machine Learning High Energy Physics - Experiment |
| url | https://arxiv.org/abs/2601.11719 |