Neural networks routinely exhibit behaviours reminiscent of probabilistic inference, such as grouping data, specialising in particular patterns, and quantifying uncertainty, yet the underlying reasons for these abilities have remained unclear. Alan Oursland demonstrates a fundamental connection between the way these networks learn and the well-established process of expectation-maximization, a powerful technique for building probabilistic models. This work reveals that when a neural network minimises errors based on distances or energies, it simultaneously performs an implicit form of probabilistic inference, calculating ‘responsibilities’ not as separate steps, but directly through the optimisation process itself. This discovery unifies diverse learning approaches, from unsupervised data clustering to supervised classification, and explains the inherent Bayesian structure observed in advanced models like transformers, suggesting this isn’t an accidental outcome but a direct result of the network’s learning objective.

Interpretation is provided through direct derivation. For any objective with log-sum-exp structure over distances or energies, the gradient with respect to each distance is exactly the negative posterior responsibility of the corresponding component, expressed as ∂L/∂dj = −rj. This finding represents an algebraic identity, not an approximation, and has significant implications for optimisation strategies. Consequently, gradient descent on such objectives implicitly performs expectation-maximization, where responsibilities are not auxiliary variables requiring separate computation, but rather gradients to be directly applied. This eliminates the need for an explicit inference algorithm, as inference becomes embedded within the optimisation process itself. The result unifies three regimes of learning under a single mechanism, encompassing unsupervised mixture models.

Neural Networks Embody Implicit Expectation-Maximization

The central claim is that many neural networks, particularly those trained with certain loss functions, are implicitly performing Expectation-Maximization (EM). EM is a statistical algorithm used to find maximum likelihood estimates in the presence of latent variables. The key is that the network is trained using a loss function based on distances between data points and learned components, with the loss function taking the form: L = log P j exp(−dj), where dj is a distance and Pj is a probability. The authors prove a crucial mathematical identity: ∂L/∂dj = −rj, meaning the gradient of the loss function with respect to the distance (dj) is equal to the negative responsibility (rj).

In EM, the responsibility of a component for a given data point represents how much that component explains the data. Here, the network doesn’t explicitly calculate responsibilities; they are inherent in the gradients used to update the network’s weights. Because of this, the forward pass acts like the Expectation (E) step in EM, and the backward pass acts like the Maximization (M) step, simultaneously inferring latent structure and optimising parameters. The network learns by adjusting its internal representations to minimize distances to learned prototypes, and the way it adjusts those representations is the process of inferring which prototypes are most relevant to each data point.

This framework unifies several areas of machine learning: traditional Gaussian Mixture Models (GMMs) achieve the same result as networks trained with the right loss function, but without explicitly coding the EM algorithm. Attention mechanisms, crucial in modern NLP, can be seen as a weighted mixture of experts, with attention weights emerging naturally from the optimisation process. Even standard cross-entropy classification can be viewed as a special case of implicit EM, where the network learns to assign full responsibility to a single category. This research offers a new perspective on loss functions, demonstrating they are not arbitrary choices, but geometric priors encoding assumptions about data relationships.

The assignment of data points to components is directly visible in the gradients during training, offering improved interpretability. Log-sum-exp (LSE) is essential for inducing competition between components, leading to responsibility weighting and inference-like behaviour. The authors identify areas for future research, including addressing volume control issues, extending the framework to handle noisy or incomplete labels, and developing objectives that allow the network to reject inputs that don’t fit known categories. Developing diagnostic tools to measure implicit EM behaviour and investigating the connection to generalization, scaling laws, and emergent abilities remain open questions. In essence, this paper proposes a fundamental shift in how we think about neural network training, suggesting that many networks are not just learning to predict, but implicitly performing a form of probabilistic inference.

Gradient Descent as Expectation-Maximization in Networks

Scientists have demonstrated a fundamental connection between standard neural network training and probabilistic inference, revealing that gradient descent implicitly performs expectation-maximization. The work establishes a direct algebraic identity, showing that for any objective function with a specific log-sum-exp structure, the gradient concerning each distance is precisely the negative posterior responsibility of the corresponding component. This finding is not an approximation, but a mathematical certainty, holding true across diverse neural network architectures including classification heads and energy-based models. Experiments reveal that this relationship unifies several learning regimes under a single framework, encompassing unsupervised mixture modeling, query-conditioned learning, and cross-entropy classification.

This breakthrough delivers a new understanding of how neural networks learn, demonstrating that the computation of responsibilities is embedded within the optimisation process itself. The research mathematically proves that responsibilities are not auxiliary variables needing separate computation, but are directly represented by the gradients during training. Specifically, the team derived that the gradient of the log-sum-exp objective with respect to a distance is equal to the negative responsibility, a result confirmed through rigorous algebraic derivation. Measurements confirm that during training, the forward pass implicitly determines responsibilities, while backpropagation propagates these responsibilities as the learning signal, effectively mirroring the M-step of expectation-maximization. This eliminates the need for a discrete E-step, streamlining the learning process and demonstrating a fundamental equivalence between optimisation and probabilistic inference. The team’s work establishes that the semantics assigned to network outputs, interpreting them as energies or distances, are crucial for understanding this inherent Bayesian structure.

Gradient Descent Is Expectation-Maximization

This research demonstrates a fundamental connection between the training of neural networks and the process of probabilistic inference, specifically expectation-maximization. Scientists have established that for objectives formulated as log-sum-exp functions over distances, the gradient calculation with respect to each distance precisely corresponds to the negative posterior responsibility of that component. This finding is an algebraic identity, meaning it holds true without approximation and reveals that gradient descent implicitly performs expectation-maximization, embedding inference directly within the optimisation process. The implications of this work are significant, as it unifies several seemingly disparate areas of machine learning under a single theoretical framework.

Unsupervised mixture learning, attention mechanisms in transformers, and cross-entropy classification are now understood as different manifestations of the same underlying dynamics, differing only in the nature of observed and latent variables. The observed Bayesian structure in transformer networks is not an unexpected outcome, but rather a necessary consequence of the training objectives employed, suggesting optimisation and inference are fundamentally the same process at different scales. The authors acknowledge that further diagnostic tools are needed to verify this mechanism in practice and to identify potential failures or limitations, and they propose measuring extracted responsibilities from gradients to track specialisation during training. Future work could focus on applying these tools to a wider range of models and datasets to assess the generality of these findings.

👉 More information
🗞 Gradient Descent as Implicit EM in Distance-Based Neural Models
🧠 ArXiv: https://arxiv.org/abs/2512.24780

Tags:

Bayesian uncertainty Energy-Based Models expectation-maximization Gradient Descent log-sum-exp objectives mixture modeling neural networks posterior responsibility Probabilistic Inference Transformers

Neural Networks Demonstrate Bayesian Uncertainty Tracking Via Implicit EM with Distances

Neural Networks Embody Implicit Expectation-Maximization

Gradient Descent as Expectation-Maximization in Networks

Gradient Descent Is Expectation-Maximization

Rohail T.

Latest Posts by Rohail T.:

Lasers Unlock New Tools for Molecular Sensing

Light’s Polarisation Fully Controlled on a Single Chip

New Quantum Algorithms Deliver Speed-Ups Without Sacrificing Predictability