Scientists are increasingly interested in understanding how neural networks learn to perform complex, structured computations over sequences. Giovanni Luca Marchetti from KTH Royal Institute of Technology, Daniel Kunin from UC Berkeley, and Adele Myers from UC Santa Barbara, alongside Acosta, Miolane et al., have introduced a novel task, sequential group composition, to investigate this phenomenon. The researchers tasked networks with predicting the cumulative product of elements from a finite group, revealing how architectures learn order-sensitive, non-linear relationships. This work is significant because it demonstrates that deeper models can exploit inherent mathematical structure, specifically associativity, to achieve substantially improved scaling compared to shallower networks, offering a tractable means to examine the mechanics of deep learning.
Learning arithmetic with sequential data reveals benefits of network depth for generalization and systematic compositionality
Scientists have uncovered how neural networks learn to perform structured operations like arithmetic and algorithmic computation through a novel investigation of sequential data processing. This work introduces the sequential group composition task, a method for examining how networks learn to map a sequence of group elements to their cumulative product.
The research demonstrates that networks can learn this task, but the efficiency of learning is heavily influenced by both the network’s architecture and the underlying structure of the group itself. Crucially, the study reveals that deeper networks significantly outperform shallower ones by exploiting the associative properties of the task, dramatically improving scaling efficiency.
The sequential group composition task requires a network to predict the cumulative product of a sequence of elements drawn from a finite group, each element encoded as a real vector. Researchers proved that two-layer networks learn this task by progressively identifying irreducible representations of the group, guided by the Fourier statistics of the encoding.
While these networks can achieve perfect performance, they require a hidden width that increases exponentially with the sequence length. This limitation highlights a fundamental challenge in scaling neural networks to handle longer sequences and more complex computations. In contrast, deeper architectures offer a solution to this scaling problem.
Recurrent neural networks compose elements sequentially, completing the task in a number of steps proportional to the sequence length. Multilayer networks, however, achieve even greater efficiency by composing adjacent pairs of elements in parallel, reducing the number of layers required to only logarithmic with respect to the sequence length.
This parallel processing capability demonstrates how architectural design can dramatically impact computational efficiency. The study establishes that the sequential group composition task is both order-sensitive and nonlinear, necessitating a complex architecture for successful learning. Analysis of the task reveals a group-specific Fourier decomposition, allowing for a precise understanding of how features are learned during training. These findings position sequential group composition as a valuable tool for developing a mathematical theory of deep learning, offering insights into how networks learn from sequential data and paving the way for more efficient and powerful architectures.
Network architecture and learning dynamics for sequential group composition are critical for robust generalization
A two-layer quadratic multilayer perceptron (MLP) serves as the primary tool for investigating how neural networks learn sequential group composition. The network receives a sequence of elements from a finite group, encoded as a real vector xg of dimension k|G|, and predicts their cumulative product. The output of the network is computed as f(xg; Θ) = Wout σ (Win xg), where Win, of size H×k|G|, embeds the input sequence into a hidden representation, σ is an element-wise monic polynomial of degree k, and Wout, of size |G|×H, unembeds the hidden representation.
This computation is also expressed as a sum over the H hidden neurons, each contributing fi(xg; θi) = wi σ k X j=1 ⟨ui j, xgj⟩, where ui and wi represent the input and output weights for the ith neuron, respectively. Parameters are initialized from a normal distribution N(0, α2) and evolve under a time-rescaled gradient flow, θi = −ηθi∇θiL(Θ), with a neuron-dependent learning rate ηθi = ∥θi∥1−k log(1/α).
This vanishing initialization regime allows for the application of Alternating Gradient Flows (AGF), a framework describing gradient dynamics in two-layer networks. AGF posits that hidden neurons exist in either a dormant state, with negligible influence on the output, or an active state, directly shaping the output.
The learning process unfolds in two alternating phases. Utility maximization involves dormant neurons competing to align with informative directions in the data, maximizing U(θi) = 1 |G|k X g∈Gk ⟨f(xg; θi), xg1:k −f(xg; ΘA)⟩ subject to the constraint ∥θi∥= 1. Subsequently, cost minimization occurs as active neurons collaborate to minimize the loss L(ΘA) while maintaining a norm ∥ΘA∥≥0.
This iterative process generates characteristic staircase-like loss curves, with plateaus indicating utility maximization and drops signifying cost minimization. Analysis reveals that irreps of the group are learned sequentially, dictated by the Fourier statistics of the input encoding vector x, as demonstrated in Figure 3.
The study assumes the input is mean-centered, bx[ρtriv] = ⟨x, 1⟩= 0, and that for all irreps ρ, bx[ρ] is either invertible or zero, ensuring non-degeneracy and separation in the Fourier coefficients. The per-neuron function is decomposed into two terms: f(xg; θi)(×) and f(xg; θi)(+), to facilitate analysis of interactions among inputs.
Exponential scaling of hidden width limits two-layer network compositional capacity, despite sufficient overall parameters
Two-layer networks learning the sequential group composition task require a hidden width scaling exponentially with sequence length, specifically O(exp k). Recurrent neural networks, in contrast, compose elements sequentially in only O(k) steps. Multilayer networks efficiently combine elements in parallel, achieving composition in O(log k) layers.
This research demonstrates a universality in feature learning dynamics alongside a diversity in architectural efficiency when exploiting the associativity of the task. The study establishes that the sequential group composition task is order-sensitive and nonlinear, precluding solutions from deep linear networks due to the necessity of nonlinear interactions between inputs.
A group-specific Fourier decomposition enables precise analysis of learning within two-layer networks, revealing how the group Fourier statistics of the encoding vector dictate learned features and their order of acquisition. This decomposition provides a tractable framework for understanding how neural networks learn from sequences.
Experiments reveal that networks progressively decompose the task into irreducible representations of the group, learning these components in a greedy order based on the encoding vector. Different architectures realise this process distinctly, with two-layer networks attempting to compose all k elements simultaneously.
The work demonstrates that deep networks identify efficient solutions by exploiting associativity to compose intermediate representations, even as the number of possible inputs grows exponentially with sequence length. These results position sequential group composition as a principled lens for developing a mathematical theory of learning in sequential data.
The findings extend insights from studies of Fourier features in modular addition and binary group composition to sequential group composition, deriving feature acquisition through training rather than solely empirical inspection. Analysis builds upon the Alternating Gradient Flow framework, showing networks acquire group Fourier features in a greedy order determined by their importance.
Learning structured computation through sequential group composition offers a powerful pathway to general intelligence
Researchers investigated how neural networks learn to perform structured computations like arithmetic and algorithmic operations by introducing the sequential group composition task. This task requires a network to predict the cumulative product of a sequence of elements drawn from a finite group, demanding order-sensitivity and nonlinear processing.
Analysis reveals that learning is shaped by the group’s structure, the statistics of the input encoding, and the length of the sequence. Specifically, two-layer networks learn this task by sequentially acquiring representations of the group, guided by the Fourier statistics of the encoding. While these networks can perfectly solve the task, their capacity scales exponentially with sequence length.
Deeper networks, however, overcome this limitation by leveraging the associative property of the task, composing elements either sequentially with recurrent architectures or in parallel with multilayer designs. This work offers a simplified model for understanding the mechanics of learning in neural networks.
The findings demonstrate a clear relationship between network depth and the ability to efficiently learn compositional operations. The sequential group composition task provides a tractable framework for isolating and studying the factors influencing learning. Limitations acknowledged by the authors include the specific assumptions made regarding the input data and the focus on a particular task structure. Future research could explore the generalizability of these findings to more complex tasks and investigate the role of different architectural choices in facilitating efficient learning of sequential operations.
👉 More information
🗞 Sequential Group Composition: A Window into the Mechanics of Deep Learning
🧠 ArXiv: https://arxiv.org/abs/2602.03655
