The challenge of designing effective deep neural networks, a process known as neural architecture search, is gaining considerable momentum in artificial intelligence research. Shota Suzuki and Satoshi Ono, both from the Graduate School of Science and Engineering at Kagoshima University, alongside their colleagues, address a key limitation in this field, specifically for networks that integrate information from multiple sources, such as images and text. These multimodal deep neural networks, while powerful, demand complex structures and typically require vast amounts of labelled data for successful design, a constraint this team overcomes with a novel approach. Their research introduces a self-supervised learning method that enables the automated design of these networks using only unlabelled data, representing a significant step towards more accessible and efficient artificial intelligence development.
Networks (DNN) has attracted increasing attention. Multimodal DNNs, which combine information from different sources, benefit from Neural Architecture Search (NAS) due to their structural complexity; however, constructing these architectures typically requires substantial labeled training data.
Self-Supervised Neural Architecture Search for Multimodal Data
This paper presents a new method for automatically designing the best neural network architecture for multimodal data, such as images and text. The key innovation is using self-supervised learning during the architecture search process, meaning the system learns from unlabeled data to guide the search and reduce the need for large amounts of labeled data. This approach enables the discovery of effective multimodal DNN architectures with limited labeled data, improving performance and reducing annotation costs.
The method uses a gradient-based neural architecture search, optimizing the network structure using gradient descent. Self-supervised learning, specifically contrastive learning, is employed to learn useful representations from unlabeled data, encouraging the network to group similar inputs together and separate dissimilar ones. This process, known as bilevel optimization, treats architecture search and weight training as interconnected problems.
Experiments were conducted using the MM-IMDB dataset, a collection of movie posters and descriptions for multi-label genre classification. Performance was evaluated using a weighted F1-score, which accounts for imbalances in the dataset. The proposed method was compared against existing state-of-the-art techniques, BM-NAS and MFAS, using Maxout MLP and VGG Transfer as base models for feature extraction.
Results demonstrate that the proposed method achieved higher weighted F1-scores than BM-NAS when using limited labeled data, highlighting the benefit of self-supervision. The discovered architectures often included connections from earlier layers of the image and text backbones, suggesting effective fusion of information. Performance was comparable to that of architectures found by MFAS and BM-NAS when trained with labeled data, and the method successfully discovered structures similar to BM-NAS with abundant labeled data.
The paper concludes that this method is a promising approach for neural architecture search in multimodal learning, particularly when labeled data is scarce. Future work includes evaluating the method on other datasets and further investigating the characteristics of the discovered architectures.
Self-Supervised Multimodal Neural Network Design
Scientists have developed a novel method for automatically designing deep neural networks for processing multiple data types. This work addresses the challenge of constructing effective network structures for multimodal deep learning, which often requires large amounts of labeled training data. The team proposes a self-supervised learning approach that enables network design using unlabeled data, reducing the need for costly annotation.
The research builds upon existing neural architecture search techniques, adapting them for multimodal data by incorporating contrastive learning, a method that enhances learning by identifying similarities between data samples. Experiments demonstrate the successful design of deep network architectures solely from unlabeled training data, achieved by integrating self-supervised learning into both the architecture search and model pretraining processes.
The method leverages a bilevel multimodal neural architecture search framework, building upon the BM-NAS approach and incorporating the SimCLR contrastive learning technique. This innovative approach explores the structure of the fusion model, which combines information from different data types, and identifies critical points for integrating these modalities. The team constructed a network architecture comprising backbone models tailored to each modality and a fusion model designed to interconnect them, employing a hierarchical framework to define the connections within the fusion model.
Measurements confirm the ability to construct network architectures comparable to those discovered using conventional methods that rely on labeled data, opening new avenues for developing powerful multimodal systems with reduced data requirements. This advancement promises to accelerate progress in applications requiring the integration of diverse data sources.
Self-Supervised Neural Architecture Search for Multimodal Networks
This research presents a novel gradient-based neural architecture search method specifically designed for multimodal neural networks. The team successfully integrated self-supervised learning into both feature extraction and model pretraining, enabling the automated design of effective deep neural networks using only unlabeled data.
Experimental validation on a standard dataset demonstrated that the discovered network architectures achieve performance comparable to those created using existing supervised learning methods. The findings represent a significant step towards more efficient and accessible development of multimodal deep learning systems, potentially broadening their application in areas where labeled data is scarce or expensive to obtain.
While the method showed promising results, the authors acknowledge the need for further investigation across diverse datasets to fully establish its robustness and generalizability. Future work will focus on these expanded experimental validations, aiming to provide a more comprehensive understanding of the method’s effectiveness and limitations.
👉 More information
🗞 Self-Supervised Neural Architecture Search for Multimodal Deep Neural Networks
🧠 ArXiv: https://arxiv.org/abs/2512.24793
