A reconfigurable Tensor Manipulation Unit (TMU), a near-memory hardware block, significantly accelerates tensor data movement within AI Systems on a Chip (SoCs). Fabricated in 40nm technology, the TMU achieves up to 1413x latency reduction for tensor operations and a 34.6% reduction in overall inference latency when paired with a Tensor Processing Unit (TPU).
The efficient execution of artificial intelligence algorithms increasingly relies not solely on computational power, but also on the swift and flexible manipulation of the large multi-dimensional arrays, known as tensors, that underpin these systems. Current System on Chip (SoC) designs often prioritise accelerating tensor computation, overlooking the substantial overhead associated with moving and reshaping these datasets. Researchers from the University of Macau, the Shenzhen Institutes of Advanced Technology, University College Dublin, and Nanyang Technological University address this imbalance with a novel hardware architecture. Weiyu Zhou, Zheng Wang, Chao Chen, Yike Li, Yongkui Yang, Zhuoyu Wu, and Anupam Chattopadhyay detail their work in “Tensor Manipulation Unit (TMU): Reconfigurable, Near-Memory Tensor Manipulation for High-Throughput AI SoC”, presenting a reconfigurable unit designed to optimise data movement within an AI SoC, achieving significant reductions in both operator-level and end-to-end inference latency when integrated with a Tensor Processing Unit (TPU).
Artificial intelligence systems increasingly require efficient data handling, and researchers have developed a reconfigurable Tensor Manipulation Unit (TMU) to accelerate data-intensive tensor operations within Systems on Chips (SoCs) for deep learning inference. A tensor is a multi-dimensional array, fundamental to representing data in machine learning, and manipulating these tensors—moving data between memory locations—often presents a significant performance bottleneck. Conventional AI SoC design traditionally prioritises computational power, neglecting the equally demanding task of tensor manipulation.
Researchers recognised that deep learning workloads suffer from substantial data movement overhead, and designed the TMU to operate in a near-memory fashion, directly manipulating data streams between memory locations. This architecture employs a Reduced Instruction Set Computing (RISC)-inspired execution model, simplifying instruction processing, alongside a unified addressing scheme, enabling support for a wide range of tensor transformations and streamlining data processing. This allows subsequent operations to begin before the current one completes, significantly enhancing throughput and reducing latency.
The TMU’s design incorporates double buffering and output forwarding, critical features that improve pipeline utilisation and maximise processing efficiency. Double buffering involves using two memory buffers to allow continuous data flow, while output forwarding directs the results of an operation directly to subsequent operations, minimising idle time. These techniques allow the system to overlap data transfer and computation, accelerating overall performance. The researchers fabricated the TMU in SMIC 40nm technology, achieving a remarkably small silicon footprint of only 0.019 mm², demonstrating its potential for integration into resource-constrained edge devices.
Benchmarking reveals substantial performance gains, as the TMU alone reduces operator-level latency by up to 1413x and 8.54x compared to ARM A72 and Jetson TX2 processors, respectively. These results highlight the unit’s intrinsic efficiency in handling data-intensive tasks and demonstrate its ability to significantly outperform conventional processors. The researchers validated the TMU’s performance with over ten representative tensor manipulation operators, confirming its versatility and broad applicability.
The true power of the TMU emerges when integrated with an in-house developed Tensor Processing Unit (TPU), a specialised processor designed to accelerate tensor computations. This combined system achieves a 34.6% reduction in end-to-end inference latency, demonstrating a synergistic effect that dramatically improves overall system performance.
Researchers should explore the scalability of the TMU architecture to support increasingly complex tensor operations and larger datasets, involving techniques for partitioning and distributing the workload across multiple TMU instances. Investigating dynamic reconfiguration capabilities, allowing the TMU to adapt to varying workloads, represents a promising avenue for future development, enabling the unit to optimise its performance for different types of tensor operations. Additionally, exploring the application of the TMU to other data-intensive domains beyond deep learning, such as image processing and scientific computing, could broaden its impact and applicability.
The TMU’s compact design facilitates integration into a wide range of devices, and its efficient data handling capabilities make it well-suited for deployment in resource-constrained environments.
The development of the TMU represents a significant step forward in the field of AI hardware, addressing a critical bottleneck in deep learning workloads. By prioritising efficient data handling and optimising the entire data flow within AI systems, the TMU enables significant performance improvements and opens up new possibilities for deploying AI applications in a wide range of environments. Future research should focus on exploring the scalability of the TMU architecture and investigating its application to other data-intensive domains.
👉 More information
🗞 Tensor Manipulation Unit (TMU): Reconfigurable, Near-Memory Tensor Manipulation for High-Throughput AI SoC
🧠 DOI: https://doi.org/10.48550/arXiv.2506.14364
