March 3, 2026Open Access

Tokenization Strategies for Masked-Token Prediction for Anomaly Detection at the Large Hadron Collider

Key Points

Anomaly detection identifies deviations in particle physics event distributions, enhancing data analysis.
Using masked-token prediction, the model focuses on reconstructing expected standard model processes from background events.
The analysis employs simulated LHC Run 2 proton-proton collision data under ATLAS conditions for effective tokenization strategies.
Integration of machine learning with physics principles supports adaptive searches for new physics beyond the standard model.

Abstract

Advances in Machine Learning, particularly Large Language Models (LLMs), enable more efficient interaction with complex datasets through tokenization and next- or masked-token prediction, providing a novel framework for analysing high-energy physics datasets. We explore strategies for representing particle physics data as token sequences, enabling LLM-inspired models to learn event distributions and detect anomalies in proton-proton collisions at the Large Hadron Collider (LHC). By training solely on background events, the model reconstructs expected physics processes, learning properties of the given Standard Model (SM) processes. Deviations in reconstruction scores during inference flag anomalous events, providing a data-driven approach to identify rare signatures or physics beyond the Standard Model (BSM). The method is tested using simulated LHC Run 2 (s = 13~TeV) proton-proton collision data from the Dark Machines Collaboration, replicating ATLAS conditions, focusing on SM and BSM four-top-quark final states. These tokenization strategies enable anomaly detection and suggest a path toward foundation models for the LHC and beyond, integrating state-of-the-art ML with physics principles to advance adaptive, data-driven searches for new physics.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Ambre Visive

Roberto Ruiz de austri

Polina Moskvitina

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Tokenization Strategies for Masked-Token Prediction for Anomaly Detection at the Large Hadron Collider

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study