Smarter AI Chips Boost Performance 8x with New Compression Technique

Researchers are tackling the significant challenge of deploying deep neural networks (DNNs) on resource-constrained RISC-V platforms. Theologos Anthimopoulos, Milad Kokhazadeh, and Vasilios Kelefouras, from Aristotle University of Thessaloniki and the University of Plymouth, alongside Benjamin Himpel and Georgios Keramidas, demonstrate a novel methodology for optimising fully connected layers within DNNs using Tensor Train Decomposition (TTD). This work is significant because it introduces an end-to-end design exploration process and specialised tool that efficiently navigates the complex trade-offs between computational cost, memory usage, and accuracy inherent in low-rank factorisation. By pruning inefficient decomposition shapes and applying targeted compiler optimisations, the team achieves substantial performance gains, with TT-decomposed layers running up to three times faster than IREE and eight times faster than Pluto on compressed models, paving the way for effective DNN deployment on edge and embedded devices.

Addressing the substantial computational and memory demands of these layers, which dominate resource consumption in applications like natural language processing and autonomous systems, the work introduces an end-to-end low-rank factorization (LRF) design exploration process.

This innovation enables efficient deployment of deep neural networks on resource-constrained edge and embedded devices. This targeted approach streamlines the optimisation process, moving beyond the complex trade-offs typically encountered when balancing FLOPs, memory size, inference time, and accuracy.
Following decomposition, compiler optimisations are applied to further enhance the performance of the custom T3F layers, minimising inference time and maximising computational efficiency. Evaluations demonstrate that the resulting Tensor Train-decomposed layers achieve an average speedup of 3x compared to IREE and 8x faster than Pluto on the same compressed model.

Figure 1 illustrates the significant contribution of fully connected layers to overall model parameters and floating-point operations, highlighting the impact of this optimisation. The research provides a practical solution for deploying complex DNNs on platforms with limited resources, paving the way for more powerful and efficient edge computing applications. The research systematically prunes the low-rank factorization design space by initially excluding decomposition shapes deemed inefficient and subsequently removing solutions exhibiting poor inference performance when deployed on RISC-V architectures.

This targeted pruning strategy focuses on optimizing both computational cost and speed. The study employs a specialized design tool to facilitate this optimization process for fully connected layers on RISC-V processors. This tool leverages T3F to decompose tensors, effectively breaking down larger tensors into interconnected smaller tensors using the Kronecker product.

By reshaping matrices into tensors, the TTD method compresses data while maintaining computational efficiency. The work specifically targets fully connected layers, recognising their substantial contribution to overall computational demands and memory usage in deep neural networks, as demonstrated by analysis of parameter and FLOPs percentages across various DNN models.

Following decomposition, compiler optimizations are applied to the custom T3F layers to further minimise inference time and maximise computational efficiency. Performance was evaluated by comparing the speed of TT-decomposed layers against established frameworks, revealing an average speed increase of 3x compared to IREE and 8x faster performance than Pluto on the same compressed model. The work introduces an end-to-end Low-Rank Factorization (LRF) design exploration methodology and a specialized tool for optimizing these layers on RISC-V processors.

This methodology effectively prunes the LRF design space by initially excluding inefficient decomposition shapes and subsequently removing solutions exhibiting poor inference performance on the target architecture. For a fully connected layer with dimensions 120×84, the Design Space (DS) of potential LRF configurations was explored, revealing up to 1033 possible solutions for larger layers.

High-level DS reduction techniques were employed to narrow the search by excluding decomposition shapes that did not achieve low Floating-Point Operations (FLOPs). This was followed by low-level DS reduction, eliminating solutions with inefficient execution times, ensuring retention of only the most promising configurations.

Figure 2.b demonstrates that LRF solutions with comparable memory footprints can exhibit significant differences in FLOPs and execution time, highlighting the need to evaluate configurations for both memory efficiency and computational performance. The proposed methodology reduces the design space by several orders of magnitude, enabling efficient deployment of Deep Neural Networks (DNNs) on resource-constrained platforms. Furthermore, the custom T3F layers generated through this process demonstrate substantial speedups, facilitating DNN deployment on edge and embedded devices powered by RISC-V architectures.

RISC-V Optimisation via Low-Rank Factorisation and Compiler Techniques delivers significant performance gains

An end-to-end low-rank factorization (LRF) design space exploration methodology and associated tool have been developed to optimise fully connected layers on RISC-V processors. Furthermore, heuristics tailored to RISC-V platforms prune LRF solutions that cannot achieve low inference latency, systematically eliminating candidates that fail to meet vectorisation and scalability constraints.

Critical compiler optimisations are then applied to custom layers, reducing inference latency and maximising computational efficiency. Experiments demonstrate that these Tensor Train decomposed layers execute on average three times faster than IREE and eight times faster than Pluto on the same compressed model, achieving a twelve-fold speedup compared to the original uncompressed model running on IREE.

The authors acknowledge that while low computational complexity or memory usage are desirable, they do not always translate to efficient inference times, necessitating the RISC-V specific heuristics employed. Future work could explore the broader applicability of the compiler techniques to other processor architectures, extending the impact of this research. These findings establish an efficient solution for deploying deep neural networks on edge and embedded devices powered by RISC-V architectures, paving the way for more powerful and efficient embedded machine learning applications.

👉 More information
🗞 Optimizing Tensor Train Decomposition in DNNs for RISC-V Architectures Using Design Space Exploration and Compiler Optimizations
🧠 ArXiv: https://arxiv.org/abs/2602.01996

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently

March 3, 2026

Protected: Quantum Computing Tackles Fluid Dynamics with a New, Flexible Algorithm

March 3, 2026

Protected: Silicon Unlocks Potential for Long-Distance Quantum Communication Networks

March 3, 2026