Researchers are addressing the computational limitations of scattering transforms, powerful tools for perceptual quality assessment in audio and computer vision. Christopher Mitcheltree from the Centre for Digital Music, Queen Mary University of London, UK, Vincent Lostanlen from Nantes Universit e, Ecole Centrale Nantes, CNRS, LS2N, UMR 6004, France, and Emmanouil Benetos, also of Queen Mary University of London, alongside Mathieu Lagrange from Nantes Universit e, Ecole Centrale Nantes, CNRS, LS2N, UMR 6004, France, present a novel approach called SCRAPL, Scattering Transform with Random Paths, developed in collaboration between their respective institutions. This stochastic optimisation scheme significantly accelerates the evaluation of multivariable scattering transforms, enabling their practical application within neural network training. By implementing SCRAPL for the joint time-frequency scattering transform and demonstrating its effectiveness in tasks such as digital signal processing and sound matching, the team provides a valuable contribution to differentiable audio processing and opens avenues for improved performance in machine learning applications requiring robust perceptual characterisation.
The core challenge addressed by this work is the computational expense of scattering transforms, mathematical tools that effectively assess perceptual quality but demand significant processing power when used in training algorithms. SCRAPL approximates multivariable scattering transforms by randomly sampling a subset of their constituent ‘paths’, the low-resolution coefficients that define the transform’s output, enabling efficient evaluation. This approach reduces computational demands during stochastic gradient descent by approximating the gradient of the scattering transform loss. The research focuses on the joint time, frequency scattering transform (JTFS), a technique adept at characterising complex auditory textures by analysing spectrotemporal patterns across multiple scales; implementing SCRAPL with JTFS allows for a finer analysis of intermittent sounds. To further refine this stochastic approximation, the team implemented a modified Adam optimizer, designated P-Adam, which maintains separate moment estimates for each path, adapting the smoothing time constant based on the recency of the last path draw. This contrasts with standard Adam, where smoothing is ineffective due to the non-identical distribution of path-wise gradients. The exponents within P-Adam are adjusted to reflect the number of paths, ensuring accurate capture of evolving gradient information. Complementing P-Adam, the P-SAGA algorithm was developed as an accelerated stochastic average gradient method, leveraging a memory of previous path-wise updates for a more informed gradient estimate. Unlike conventional SAG and SAGA algorithms, the memory footprint of P-SAGA scales with the number of paths rather than the dataset size, making it practical for neural network training. Researchers demonstrated SCRAPL’s effectiveness through applications in differentiable digital signal processing (DDSP), training a neural network to match sounds produced by a granular synthesizer and the Roland TR-808 drum machine. This unsupervised sound matching task highlights SCRAPL’s potential to learn complex audio representations without labelled data. An initialisation heuristic based on importance sampling was introduced, tailoring the SCRAPL algorithm to the specific characteristics of the dataset and improving both learning speed and final performance. The code and audio samples generated during the study are publicly available as a Python tool. Initial results from the granular synthesizer sound matching task reveal that SCRAPL achieves a theta-synth L1 error of 65.7 ±4.2, alongside a validation total variation of 3.27±0.12 and convergence occurring after 6014±642 optimisation steps. These figures demonstrate a substantial improvement over prior methods, notably outperforming MSS Linear (370 ±0.52), MSS Log + Linear (259 ±1.7), and MS-CLAP (166 ±8.2). SCRAPL’s performance approaches that of the full Joint Time-Frequency Scattering transform (JTFS) which yielded a theta-synth L1 error of 42.4, while significantly reducing computational demands. Incorporating P-Adam, P-SAGA, and theta-importance sampling (theta-IS) resulted in a monotonic improvement in accuracy and convergence time, with theta-IS demonstrating a particularly beneficial effect on the balanced convergence of synthesizer parameters. Specifically, theta-IS led to an overall theta-synth accuracy exceeding that achieved with uniform sampling, despite a slight decrease in theta-density performance. The study also examined performance on a chirplet synthesizer, where theta-IS improved the prediction of thetaAM by 25.55% and thetaFM by 14.80%, concurrently reducing the time to convergence by 23.50%. These results indicate that SCRAPL maintains manageable gradient variance during deep neural network training, offering a favourable balance between computational speed and convergence rate when compared to full-tree scattering. The persistent challenge of enabling machines to perceive sound as meaningful texture and structure has long been hampered by computational bottlenecks. SCRAPL offers a clever sidestep, demonstrating that intelligent sampling of complex audio features can dramatically reduce processing burden without sacrificing perceptual accuracy. This isn’t merely an incremental improvement in signal processing; it unlocks the potential for more sophisticated, real-time audio analysis in applications ranging from music production and sound design to environmental monitoring and medical diagnostics. However, the reliance on importance sampling introduces a potential bias, and its generalizability to vastly different soundscapes remains an open question. Furthermore, the method still depends on carefully tuned hyperparameters, and the optimal balance between computational efficiency and perceptual fidelity requires further investigation. The next step will likely involve exploring adaptive sampling strategies, where the algorithm learns to select the most informative patterns on the fly, and scaling this approach to even more complex and nuanced audio environments.
👉 More information
🗞 SCRAPL: Scattering Transform with Random Paths for Machine Learning
🧠 ArXiv: https://arxiv.org/abs/2602.11145
