Researchers are tackling the inherent limitations of both Convolutional Neural Networks and Vision Transformers in image analysis. Gijs Joppe Moens, Regina Beets-Tan, and Eduardo H. P. Pooch, from the Netherlands Cancer Institute and University of Amsterdam, present SONIC (Spectral Oriented Neural Invariant Convolutions), a novel approach that combines the strengths of both architectures by modelling convolutional operators with a continuous spectral parameterisation. This innovative method creates global receptive fields and adaptable filters using a small set of shared components, offering improved robustness to distortions, noise and resolution changes , and crucially, achieving comparable or superior performance to existing methods with significantly fewer parameters. Their findings, demonstrated across diverse datasets including 3D medical imaging, suggest a principled and scalable alternative to traditional spatial and spectral operators, potentially revolutionising image processing techniques.
The research introduces a continuous spectral parameterisation that models convolutional operators using a small set of shared, orientation-selective components, offering a novel way to process visual information. This innovative method moves beyond the fixed-size kernels of CNNs, which struggle with global context, and the patch-dependent nature of ViTs, which lack inherent spatial understanding. SONIC effectively bridges these gaps by creating a representation that is both structured and globally aware, promising more robust and efficient image processing.
The team achieved this breakthrough by focusing on the frequency domain, defining smooth responses across the full spectrum and yielding global receptive fields that adapt naturally to different resolutions. This spectral parameterisation allows the network to capture long-range dependencies without the need for excessively deep architectures, a common requirement in traditional CNNs. Crucially, SONIC utilises orientation-selective components, enabling it to recognise patterns regardless of their angle or position within an image. Experiments demonstrate that this approach provides a principled and scalable alternative to conventional spatial and spectral operators, offering a significant leap forward in computer vision technology.
Across a range of tests, including synthetic benchmarks, large-scale image classification, and analysis of 3D medical datasets, SONIC consistently showed improved robustness to geometric transformations, noise, and variations in resolution. Remarkably, it matched or exceeded the performance of existing convolutional, attention-based, and spectral architectures while utilising an order of magnitude fewer parameters. This reduction in parameters not only enhances computational efficiency but also contributes to better generalisation and reduced overfitting, particularly important when dealing with limited datasets. The study reveals that continuous, orientation-aware spectral parameterisations are a powerful tool for building more adaptable and efficient vision systems.
This work opens exciting possibilities for real-world applications, from improved medical image analysis and autonomous driving to more robust object recognition in challenging environments. By mimicking the efficiency and adaptability of human visual processing, SONIC offers a pathway towards computer vision models that are less sensitive to variations in image quality, orientation, and scale. The research team engineered a system where these components define smooth responses across the entire frequency domain, effectively creating global receptive fields and filters that adapt seamlessly to varying resolutions. Experiments employed synthetic benchmarks, large-scale image classification datasets, and 3D medical data to rigorously evaluate SONIC’s performance.
Researchers meticulously tested robustness against geometric transformations, noise, and resolution shifts, demonstrating significant improvements over conventional methods. The study pioneered a method for achieving comparable or superior results to CNNs, attention-based networks, and existing spectral approaches, all while utilising an order of magnitude fewer parameters. This reduction in parameters was achieved through the spectral parameterisation, which allows for a more compact representation of convolutional filters. The team developed a framework grounded in the mathematical foundations of multidimensional signals, providing inherent resolution invariance and full convolutional expressiveness.
To achieve this, scientists moved beyond traditional spatial-domain operators, which require expanding kernel support or introducing gaps to enlarge receptive fields. Instead, they harnessed the power of the frequency domain, enabling global context integration within a single layer. This technique reveals that long-range structure can be captured more efficiently through spectral representations than through local interactions alone, overcoming the limitations of fixed sampling grids used in standard convolutions. Furthermore, the research implemented a system that avoids the quadratic computational cost associated with self-attention mechanisms in ViTs. By leveraging orientation-aware spectral parameterisations, the study achieved resolution-invariant perception, drawing inspiration from the adaptability and robustness of human visual processing. The team measured improved robustness to geometric transformations, noise, and resolution shifts across synthetic benchmarks, large-scale image classification, and 3D medical datasets. Experiments revealed that SONIC achieves performance matching or exceeding that of convolutional, attention-based, and prior spectral architectures, all while utilising an order of magnitude fewer parameters. This breakthrough delivers a principled and scalable alternative to conventional spatial and spectral operators, paving the way for more efficient and adaptable vision models.
Researchers introduced orientation-selective components that define smooth responses across the full frequency domain, yielding global receptive fields and filters which adapt naturally across resolutions. CNNs, while effective at capturing local features, struggle with global context without deep architectures, and are sensitive to geometric variations. Conversely, ViTs provide global connectivity but lack spatial inductive bias and are computationally expensive, scaling quadratically with image area.
SONIC addresses these issues by providing a structured and global representation with significantly reduced parameter counts. The study demonstrated that SONIC’s spectral framework naturally provides global receptive fields and full convolutional expressiveness, offering a lightweight foundation for scalable vision models. Tests prove that the method effectively integrates information over long spatial ranges, a challenge for standard convolutions which are limited by local receptive fields. Measurements confirm that expanding receptive fields with large kernels or dilated convolutions ties filters to image resolution, limiting generalisation.
SONIC circumvents this by utilising a spectral approach, enabling resolution invariance and adaptability. Furthermore, results demonstrate that SONIC’s performance is not compromised by increased resolution; the method maintains efficiency even in high-resolution domains where attention-based mechanisms struggle. Scientists recorded that the proposed method narrows the conceptual gap between current computer vision models and human-like visual processing, which exhibits remarkable robustness and adaptability across varying orientations, scales, and resolutions.
👉 More information
🗞 SONIC: Spectral Oriented Neural Invariant Convolutions
🧠 ArXiv: https://arxiv.org/abs/2601.19884
