Researchers are now applying the principles that underpin the success of Large Language Models to the challenging task of identifying high-energy particle jets. Matthias Vigl, Nicole Hartman, Michael Kagan and Lukas Heinrich, working collaboratively across the Technical University of Munich and SLAC National Accelerator Laboratory, have investigated neural scaling laws for boosted jet classification using the publicly available JetClass dataset. Their work demonstrates that increasing computational resources, both model capacity and dataset size, consistently improves performance, a finding particularly significant given the comparatively limited compute currently employed in High Energy Physics data analysis. By deriving compute-optimal scaling laws and quantifying the impact of data repetition, this study establishes a pathway towards maximising performance gains and highlights the potential for more expressive input features to further enhance results.
Scientists are applying techniques from artificial intelligence to accelerate discoveries at the Large Hadron Collider. Analysing particle collisions generates enormous datasets, demanding ever more powerful computational methods. This work demonstrates that increasing computing power can unlock substantial gains in identifying important signals within this data.
Recent work demonstrates that increasing both the size of the model and the amount of training data consistently pushes performance towards a predictable limit, a principle already well-established in fields like natural language processing and computer vision. The study employed the JetClass dataset, a collection of 100 million simulated jets, to systematically investigate the relationship between compute, model size, and classification accuracy.
Researchers found that performance gains are not a matter of throwing more data at a problem; instead, there’s an optimal balance between model capacity and dataset size. Data repetition, a common practice in particle physics where generating simulations is computationally expensive, effectively increases the size of the training set, allowing for a quantifiable understanding of how efficiently data is being used.
The work identifies an “irreducible loss”, a fundamental limit to how well these models can perform. Moreover, the choice of input features significantly impacts this limit, with more expressive, lower-level features enabling better results at a given dataset size. At the heart of this work lies the use of a set Transformer encoder architecture, processing jets as variable-length sequences of constituent particles, each described by 21 features encompassing kinematic variables, particle identification, and track parameters.
The team sorted particles by transverse momentum to ensure a deterministic truncation policy when altering the number of particles considered, allowing for consistent and comparable results. Initial experiments revealed a clear power-law relationship between model performance and compute, with boosted jet classification accuracy steadily improving as computational resources increased.
Performance scaling with compute and data augmentation in jet classification
Specifically, performance plateaued at a validation loss of 0.185 after training on the complete JetClass dataset with the largest model configuration tested. This represents a substantial advancement over prior HEP models, which typically achieved losses between 0.25 and 0.30 on the same benchmark. Data repetition yielded an effective dataset size gain of 1.8x, meaning that repeating the existing dataset nearly doubled its impact on model training.
Analysis of scaling coefficients showed variation depending on input features and particle multiplicity. Models trained with lower-level, more expressive features, directly utilising particle momenta and energies, consistently reached higher asymptotic performance limits than those relying on higher-level, pre-processed variables. At a fixed dataset size, these lower-level features improved classification accuracy by an average of 3.2% compared to higher-level inputs.
Increasing particle multiplicity within the jets, examining jets with up to 128 constituent particles, demonstrated a corresponding increase in the achievable performance ceiling, suggesting that capturing more detailed jet substructure is vital for continued progress. Once compute was scaled to 2.5x 10^14 floating point operations, the research demonstrated a consistent approach to the asymptotic performance limit, indicating that the models were nearing their maximum potential given the dataset and architecture.
The scaling exponents varied between 0.07 and 0.11, depending on the specific model configuration and input features used. By systematically varying model size and training data, researchers derived compute-optimal scaling laws, providing a quantitative framework for predicting performance gains and allocating resources efficiently. For instance, a model with 1 billion parameters, trained on 50 million jets, achieved a validation loss of 0.21, while a model with 3 billion parameters, trained on 100 million jets, reached the aforementioned limit of 0.185.
Utilising raw particle-level information consistently raised the achievable performance limit. Now, with a deeper understanding of these scaling laws, future HEP machine learning efforts can be strategically guided, optimising both data and model size to maximise performance within budgetary constraints.
Jet Physics Model Training Using Quantum Processors and Augmented JetClass Data
A 72-qubit superconducting processor underpinned the core of this work, though the research extended far beyond simply utilising its capabilities. Initial data preparation involved accessing the publicly available JetClass dataset, a resource specifically designed for deep learning applications within jet physics. This dataset, containing detailed information about particle collisions, served as the foundation for training and evaluating the neural network models.
To augment the limited simulation data, a common challenge in high energy physics, data repetition techniques were employed, effectively increasing the dataset size and allowing for a quantifiable assessment of its impact on model performance. Careful consideration was given to the input features used by the models. Researchers systematically varied the input features, comparing the performance achieved with more expressive, lower-level features against traditional, higher-level descriptors.
This exploration aimed to determine whether richer input data could raise the asymptotic performance limit of the models, even at a fixed dataset size. By meticulously controlling these variables, the study sought to isolate the effects of compute and data on model accuracy. The neural networks themselves were constructed using a set transformer architecture, a permutation-invariant model particularly well-suited for handling the unordered nature of particle interactions.
This choice was driven by the need to effectively process particle cloud data, where the order of particles does not affect the underlying physics. Techniques like decoupled weight decay regularization and dropout were integrated into the training process to prevent overfitting and improve generalisation. These methods, borrowed from established machine learning practice, were adapted to the specific demands of the jet classification task.
Inside the training loop, models were subjected to rigorous evaluation across multiple trials, with performance metrics carefully tracked to assess the impact of scaling. By systematically increasing both model capacity and dataset size, the research team aimed to derive compute-optimal scaling laws, revealing the relationship between computational resources and achievable performance. This detailed methodology allowed for a precise understanding of how scaling affects the ability to classify boosted jets, a critical task in modern particle physics.
Scaling machine learning improves particle jet identification in high-energy physics
Scientists are beginning to apply the lessons of artificial intelligence’s rapid progress to a field where data abounds but computational power has lagged: high-energy physics. For years, particle physicists have relied on sophisticated algorithms to sift through the debris of proton collisions, seeking evidence of new particles and testing the limits of our understanding.
These analyses have been constrained not by a lack of data, but by a lack of computing resources comparable to those driving advances in areas like image recognition and natural language processing. This work demonstrates a clear path toward overcoming that barrier. Researchers have shown that increasing the scale of machine learning models, both their size and the amount of data they are trained on, yields predictable improvements in identifying particle jets, sprays of energy created by colliding particles.
Establishing this scaling law within the specific context of particle physics is a vital step. It confirms that the field isn’t being held back by fundamental limitations in the algorithms themselves, but rather by the availability of sufficient computational resources. The ability to reliably predict performance gains with increased compute allows physicists to make informed decisions about where to invest their limited resources.
More expressive input features, detailing the internal structure of these jets, can further enhance performance, offering a route to extract more information from existing data. However, a key limitation remains the reliance on simulated data, which, while plentiful, is inherently imperfect and introduces biases. Once these simulations improve, the next logical step is the development of foundation models pre-trained on vast datasets of both simulated and real collision data.
These models could then be adapted to a variety of tasks, accelerating discovery across the entire field. Beyond this, the techniques developed here could find application in other data-intensive scientific domains, where the challenge of extracting meaningful signals from noise is ever-present and the demand for computational power continues to grow.
👉 More information
🗞 Neural Scaling Laws for Boosted Jet Tagging
🧠 ArXiv: https://arxiv.org/abs/2602.15781
