Catalogue Search | MBRL

PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors

by Garofalo, Angelo , Rusci, Manuele , Rossi, Davide

2020

We present PULP-NN, an optimized computing library for a parallel ultra-low-power tightly coupled cluster of RISC-V processors. The key innovation in PULP-NN is a set of kernels for quantized neural network inference, targeting byte and sub-byte data types, down to INT-1, tuned for the recent trend toward aggressive quantization in deep neural network inference. The proposed library exploits both the digital signal processing extensions available in the PULP RISC-V processors and the cluster’s parallelism, achieving up to 15.5 MACs/cycle on INT-8 and improving performance by up to 63 × with respect to a sequential implementation on a single RISC-V core implementing the baseline RV32IMC ISA. Using PULP-NN, a CIFAR-10 network on an octa-core cluster runs in 30 × and 19.6 × less clock cycles than the current state-of-the-art ARM CMSIS-NN library, running on STM32L4 and STM32H7 MCUs, respectively. The proposed library, when running on a GAP-8 processor, outperforms by 36.8 × and by 7.45 × the execution on energy efficient MCUs such as STM32L4 and high-end MCUs such as STM32H7 respectively, when operating at the maximum frequency. The energy efficiency on GAP-8 is 14.1 × higher than STM32L4 and 39.5 × higher than STM32H7, at the maximum efficiency operating point. This article is part of the theme issue ‘Harmonizing energy-autonomous computing and intelligence’.

Journal Article

Share this book

Add to My Shelf

A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference

by Kersting, Benedikt , Philip, Timothy , Francese, Pier Andrea in 639/166/987 , 639/705/258 , Accuracy

2023

Analogue in-memory computing (AIMC) with resistive memory devices could reduce the latency and energy consumption of deep neural network inference tasks by directly performing computations within memory. However, to achieve end-to-end improvements in latency and energy consumption, AIMC must be combined with on-chip digital operations and on-chip communication. Here we report a multicore AIMC chip designed and fabricated in 14 nm complementary metal–oxide–semiconductor technology with backend-integrated phase-change memory. The fully integrated chip features 64 AIMC cores interconnected via an on-chip communication network. It also implements the digital activation functions and additional processing involved in individual convolutional layers and long short-term memory units. With this approach, we demonstrate near-software-equivalent inference accuracy with ResNet and long short-term memory networks, while implementing all the computations associated with the weight layers and the activation functions on the chip. For 8-bit input/output matrix–vector multiplications, in the four-phase (high-precision) or one-phase (low-precision) operational read mode, the chip can achieve a maximum throughput of 16.1 or 63.1 tera-operations per second at an energy efficiency of 2.48 or 9.76 tera-operations per second per watt, respectively. A multicore analogue in-memory computing chip that is designed and fabricated in 14 nm complementary metal–oxide–semiconductor technology with backend-integrated phase-change memory can be used for deep neural network inference.

Journal Article

Share this book

Add to My Shelf

PULP-NN

by Garofalo, Angelo , Rusci, Manuele , Benini, Luca

2020

We present PULP-NN, an optimized computing library for a parallel ultra-low-power tightly coupled cluster of RISC-V processors. The key innovation in PULP-NN is a set of kernels for quantized neural network inference, targeting byte and subbyte data types, down to INT-1, tuned for the recent trend toward aggressive quantization in deep neural network inference. The proposed library exploits both the digital signal processing extensions available in the PULP RISC-V processors and the cluster’s parallelism, achieving up to 15.5 MACs/cycle on INT-8 and improving performance by up to 63× with respect to a sequential implementation on a single RISC-V core implementing the baseline RV32IMC ISA. Using PULP-NN, a CIFAR-10 network on an octa-core cluster runs in 30× and 19.6× less clock cycles than the current state-of-the-art ARM CMSISNN library, running on STM32L4 and STM32H7 MCUs, respectively. The proposed library, when running on a GAP-8 processor, outperforms by 36.8× and by 7.45× the execution on energy efficient MCUs such as STM32L4 and high-end MCUs such as STM32H7 respectively, when operating at the maximum frequency. The energy efficiency on GAP-8 is 14.1× higher than STM32L4 and 39.5× higher than STM32H7, at the maximum efficiency operating point. This article is part of the theme issue ‘Harmonizing energy-autonomous computing and intelligence’.

Journal Article

Share this book

Add to My Shelf

MATCHA: Efficient Deployment of Deep Neural Networks on Multi-Accelerator Heterogeneous Edge SoCs

by Garofalo, Angelo , Palesi, Maurizio , Russo, Enrico in Accelerators , Artificial neural networks , Heterogeneity

2026

Deploying DNNs on System-on-Chips (SoC) with multiple heterogeneous acceleration engines is challenging, and the majority of deployment frameworks cannot fully exploit heterogeneity. We present MATCHA, a unified DNN deployment framework that generates highly concurrent schedules for parallel, heterogeneous accelerators and uses constraint programming to optimize L3/L2 memory allocation and scheduling. Using pattern matching, tiling, and mapping across individual HW units enables parallel execution and high accelerator utilization. On the MLPerf Tiny benchmark, using a SoC with two heterogeneous accelerators, MATCHA improves accelerator utilization and reduces inference latency by up to 35% with respect to the the state-of-the-art MATCH compiler.

Paper

Share this book

Add to My Shelf

RedMulE-FT: A Reconfigurable Fault-Tolerant Matrix Multiplication Engine

by Bertaccini, Luca , Garofalo, Angelo , Item, Maurus in Error correction , Error correction & detection , Error detection codes

2025

As safety-critical applications increasingly rely on data-parallel floating-point computations, there is an increasing need for flexible and configurable fault tolerance in parallel floating-point accelerators such as tensor engines. While replication-based methods ensure reliability but incur high area and power costs, error correction codes lack the flexibility to trade off robustness against performance. This work presents RedMulE-FT, a runtime-configurable fault-tolerant extension of the RedMulE matrix multiplication accelerator, balancing fault tolerance, area overhead, and performance impacts. The fault tolerance mode is configured in a shadowed context register file before task execution. By combining replication with error-detecting codes to protect the data path, RedMulE-FT achieves an 11x uncorrected fault reduction with only 2.3% area overhead. Full protection extends to control signals, resulting in no functional errors after 1M injections during our extensive fault injection simulation campaign, with a total area overhead of 25.2% while maintaining a 500 MHz frequency in a 12 nm technology.

Paper

Share this book

Add to My Shelf

Quadrilatero: A RISC-V programmable matrix coprocessor for low-power edge applications

by Garofalo, Angelo , Perotti, Matteo , Bertuletti, Marco in Array processors , Coprocessors , Floating point arithmetic

2025

The rapid growth of AI-based Internet-of-Things applications increased the demand for high-performance edge processing engines on a low-power budget and tight area constraints. As a consequence, vector processor architectures, traditionally designed for high-performance computing (HPC), made their way into edge devices, promising high utilization of floating-point units (FPUs) and low power consumption. However, vector processors can only exploit a single dimension of parallelism, leading to expensive accesses to the vector register file (VRF) when performing matrix computations, which are pervasive in AI workloads. To overcome these limitations while guaranteeing programmability, many researchers and companies are developing dedicated instructions for a more efficient matrix multiplication (MatMul) execution. In this context, we propose Quadrilatero, an open-source RISC-V programmable systolic array coprocessor for low-power edge applications that implements a streamlined matrix ISA extension. We evaluate the post-synthesis power, performance, and area (PPA) metrics of Quadrilatero in a mature 65-nm technology node, showing that it requires only 0.65 mm^2 and that it can reach up to 99.4% of FPU utilization. Compared to a state-of-the-art open-source RISC-V vector processor and a hybrid vector-matrix processor optimized for embedded applications, Quadrilatero improves area efficiency and energy efficiency by up to 77% and 15%, respectively.

Paper

Share this book

Add to My Shelf

Regional Convergence: Evidence from a New State-by-State Capital Stock Series

by Yamarik, Steven , Garofalo, Gasper A. in Capital , Capital stock , Capital stocks

2002

This paper seeks to reconcile the growth empirics technique of Mankiw, Romer, and Weil (1992) with the empirical results of Barro and Sala-“i-Martin (1991) through the development of a new database covering the 1977-96 period. We create state-by-state capital stock and gross investment estimates by apportioning the national capital stock among the states. Using these estimates along with gross state product and employment data, we find evidence that the Solow growth model explains state-wide growth during this period. We consistently find a rate of convergence of around 2%. Our results, as a consequence, suggest that the empirical results of Barro and Sala-í-Martin are driven by the neoclassical growth process of Solow.

Journal Article

Share this book

Add to My Shelf

relOBI: A Reliable Low-latency Interconnect for Tightly-Coupled On-chip Communication

by Garofalo, Angelo , Benini, Luca , Rogenmoser, Michael in Error correction , Microprocessors , Soft errors

2025

On-chip communication is a critical element of modern systems-on-chip (SoCs), allowing processor cores to interact with memory and peripherals. Interconnects require special care in radiation-heavy environments, as any soft error within the SoC interconnect is likely to cause a functional failure of the whole SoC. This work proposes relOBI, an extension to the Open Bus Interface (OBI) combining triple modular redundancy (TMR) for critical handshake signals with error correction codes (ECC) protection on other signals. Implementing and testing the reliable crossbar shows improved reliability to injected single faults from a vulnerability of 34.85 % to zero compared to the irredundant baseline, with an area increase of 2.6 \\(\\). The area overhead is 1.8 \\(\\) lower than that reported in the literature for fine-grained triplication and voting.

Paper

Share this book

Add to My Shelf

A Gigabit, DMA-enhanced Open-Source Ethernet Controller for Mixed-Criticality Systems

by Garofalo, Angelo , Liang, Chaoqun , Benz, Thomas in Autonomous navigation , Controllers , Ethernet

2024

The ongoing revolution in application domains targeting autonomous navigation, first and foremost automotive \"zonalization\", has increased the importance of certain off-chip communication interfaces, particularly Ethernet. The latter will play an essential role in next-generation vehicle architectures as the backbone connecting simultaneously and instantaneously the zonal/domain controllers. There is thereby an incumbent need to introduce a performant Ethernet controller in the open-source HW community, to be used as a proxy for architectural explorations and prototyping of mixed-criticality systems (MCSs). Driven by this trend, in this work, we propose a fully open-source, DMA-enhanced, technology-agnostic Gigabit Ethernet architecture that overcomes the limitations of existing open-source architectures, such as Lowrisc's Ethernet, often tied to FPGA implementation, performance-bound by sub-optimal design choices such as large memory buffers, and in general not mature enough to bridge the gap between academia and industry. Besides the area advantage, the proposed design increases packet transmission speed up to almost 3x compared to Lowrisc's and is validated through implementation and FPGA prototyping into two open-source, heterogeneous MCSs.

Paper

Share this book

Add to My Shelf

Who Checks the Checker? Enhancing Component-level Architectural SEU Fault Tolerance for End-to-End SoC Protection

by Garofalo, Angelo , Wu, Chen , Benini, Luca in Fault tolerance , Single event upsets , System on chip

2026

Single-event upset (SEU) fault tolerance for systems-on-chip (SoCs) in radiation-heavy environments is often addressed by architectural fault-tolerance approaches protecting individual SoC components (e.g., cores, memories) in isolation. However, the protection of voting logic and interconnections among components is also critical, as these become single points of failure in the design. We investigate combining multiple fault-tolerance approaches targeting individual SoC components, including interconnect and voting logic to ensure end-to-end SoC-level architectural SEU fault tolerance, while minimizing implementation area overheads. Enforcing an overlap between the protection methods ensures hardening of the whole design without gaps, while curtailing overheads. We demonstrate our approach on a RISC-V microcontroller SoC. SEU fault-tolerance is assessed with simulation-based fault injection. Overheads are assessed with full physical implementation. Tolerance to over 99.9% of faults in both RTL and implemented netlist is demonstrated. Furthermore, the design exhibits 22% lower implementation overhead compared to a single global fault-tolerance method, such as fine-grained triplication.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter