Catalogue Search | MBRL

PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors

by Garofalo, Angelo , Rusci, Manuele , Rossi, Davide

2020

We present PULP-NN, an optimized computing library for a parallel ultra-low-power tightly coupled cluster of RISC-V processors. The key innovation in PULP-NN is a set of kernels for quantized neural network inference, targeting byte and sub-byte data types, down to INT-1, tuned for the recent trend toward aggressive quantization in deep neural network inference. The proposed library exploits both the digital signal processing extensions available in the PULP RISC-V processors and the cluster’s parallelism, achieving up to 15.5 MACs/cycle on INT-8 and improving performance by up to 63 × with respect to a sequential implementation on a single RISC-V core implementing the baseline RV32IMC ISA. Using PULP-NN, a CIFAR-10 network on an octa-core cluster runs in 30 × and 19.6 × less clock cycles than the current state-of-the-art ARM CMSIS-NN library, running on STM32L4 and STM32H7 MCUs, respectively. The proposed library, when running on a GAP-8 processor, outperforms by 36.8 × and by 7.45 × the execution on energy efficient MCUs such as STM32L4 and high-end MCUs such as STM32H7 respectively, when operating at the maximum frequency. The energy efficiency on GAP-8 is 14.1 × higher than STM32L4 and 39.5 × higher than STM32H7, at the maximum efficiency operating point. This article is part of the theme issue ‘Harmonizing energy-autonomous computing and intelligence’.

Journal Article

Share this book

Add to My Shelf

Neuromorphic Computing Using NAND Flash Memory Architecture With Pulse Width Modulation Scheme

by Lee, Jong-Ho , Lee, Sung-Tae in Algorithms , Bias , Circuits

2020

A novel operation scheme is proposed for high-density and highly robust neuromorphic computing based on NAND flash memory architecture. Analogue input is represented with time-encoded input pulse by pulse width modulation (PWM) circuit and 4-bit synaptic weight is represented with adjustable conductance of NAND cells. PWM scheme for analogue input value and proposed operation scheme are fully compatible with existing NAND flash memory architecture to implement neuromorphic system without additional change of memory architecture. Saturated current-voltage characteristic of NAND cells eliminates the effect of serial resistance of pass cells in a synaptic string and IR drop of metal wire resistance. Multiply-accumulate (MAC) operation of 4-bit weight and width-modulated input can be accomplished in a single input pulse eliminating the additional logic operation. In addition, effect of quantization training on the inference accuracy is investigated compared to post-training quantization with 4-bit weight. Finally, the low-variance conductance distribution of NAND cells obtained by read-verify-write (RVW) scheme achieves satisfying accuracy of 98.14% and 89.6% for the MNIST and CIFAR10 datasets, respectively.

Journal Article

Share this book

Add to My Shelf

Balanced Quantization: An Effective and Efficient Approach to Quantized Neural Networks

by Zhou, Shu-Chang , He, Qin-Yao , Wen, He in Artificial Intelligence , Bins , Complexity

2017

Quantized neural networks (QNNs), which use low bitwidth numbers for representing parameters and performing computations, have been proposed to reduce the computation complexity, storage size and memory usage. In QNNs, parameters and activations are uniformly quantized, such that the multiplications and additions can be accelerated by bitwise operations. However, distributions of parameters in neural networks are often imbalanced, such that the uniform quantization determined from extremal values may underutilize available bitwidth. In this paper, we propose a novel quantization method that can ensure the balance of distributions of quantized values. Our method first recursively partitions the parameters by percentiles into balanced bins, and then applies uniform quantization. We also introduce computationally cheaper approximations of percentiles to reduce the computation overhead introduced. Overall, our method improves the prediction accuracies of QNNs without introducing extra computation during inference, has negligible impact on training speed, and is applicable to both convolutional neural networks and recurrent neural networks. Experiments on standard datasets including ImageNet and Penn Treebank confirm the effectiveness of our method. On ImageNet, the top-5 error rate of our 4-bit quantized GoogLeNet model is 12.7%, which is superior to the state-of-the-arts of QNNs.

Journal Article

Share this book

Add to My Shelf

Pattern Classification Using Quantized Neural Networks for FPGA-Based Low-Power IoT Devices

by Siddique, Abrar , Behera, Prangyadarsini , Delwar, Tahesin Samira in Accuracy , Algorithms , Artificial intelligence

2022

With the recent growth of the Internet of Things (IoT) and the demand for faster computation, quantized neural networks (QNNs) or QNN-enabled IoT can offer better performance than conventional convolution neural networks (CNNs). With the aim of reducing memory access costs and increasing the computation efficiency, QNN-enabled devices are expected to transform numerous industrial applications with lower processing latency and power consumption. Another form of QNN is the binarized neural network (BNN), which has 2 bits of quantized levels. In this paper, CNN-, QNN-, and BNN-based pattern recognition techniques are implemented and analyzed on an FPGA. The FPGA hardware acts as an IoT device due to connectivity with the cloud, and QNN and BNN are considered to offer better performance in terms of low power and low resource use on hardware platforms. The CNN and QNN implementation and their comparative analysis are analyzed based on their accuracy, weight bit error, RoC curve, and execution speed. The paper also discusses various approaches that can be deployed for optimizing various CNN and QNN models with additionally available tools. The work is performed on the Xilinx Zynq 7020 series Pynq Z2 board, which serves as our FPGA-based low-power IoT device. The MNIST and CIFAR-10 databases are considered for simulation and experimentation. The work shows that the accuracy is 95.5% and 79.22% for the MNIST and CIFAR-10 databases, respectively, for full precision (32-bit), and the execution time is 5.8 ms and 18 ms for the MNIST and CIFAR-10 databases, respectively, for full precision (32-bit).

Journal Article

Share this book

Add to My Shelf

Optimizing Data Flow in Binary Neural Networks

by Vorabbi, Lorenzo , Santi, Stefano , Maltoni, Davide in Accuracy , Approximation , Back propagation

2024

Binary neural networks (BNNs) can substantially accelerate a neural network’s inference time by substituting its costly floating-point arithmetic with bit-wise operations. Nevertheless, state-of-the-art approaches reduce the efficiency of the data flow in the BNN layers by introducing intermediate conversions from 1 to 16/32 bits. We propose a novel training scheme, denoted as BNN-Clip, that can increase the parallelism and data flow of the BNN pipeline; specifically, we introduce a clipping block that reduces the data width from 32 bits to 8. Furthermore, we decrease the internal accumulator size of a binary layer, usually kept using 32 bits to prevent data overflow, with no accuracy loss. Moreover, we propose an optimization of the batch normalization layer that reduces latency and simplifies deployment. Finally, we present an optimized implementation of the binary direct convolution for ARM NEON instruction sets. Our experiments show a consistent inference latency speed-up (up to 1.3 and 2.4× compared to two state-of-the-art BNN frameworks) while reaching an accuracy comparable with state-of-the-art approaches on datasets like CIFAR-10, SVHN, and ImageNet.

Journal Article

Share this book

Add to My Shelf

The Impact of 8- and 4-Bit Quantization on the Accuracy and Silicon Area Footprint of Tiny Neural Networks

by Obszarski, Paweł , Skierkowski, Marcel , Przychodny, Jakub in Accuracy , Application specific integrated circuits , Artificial intelligence

2025

In the field of embedded and edge devices, efforts have been made to make deep neural network models smaller due to the limited size of the available memory and the low computational efficiency. Typical model footprints are under 100 KB. However, for some applications, models of this size are too large. In low-voltage sensors, signals must be processed, classified or predicted with an order of magnitude smaller memory. Model downsizing can be performed by limiting the number of model parameters or quantizing their weights. These types of operations have a negative impact on the accuracy of the deep network. This study tested the effect of model downscaling techniques on accuracy. The main idea was to reduce neural network models to 3 k parameters or less. Tests were conducted on three different neural network architectures in the context of three separate research problems, modeling real tasks for small networks. The impact of the reduction in the accuracy of the network depends mainly on its initial size. For a network reduced from 40 k parameters, a decrease in accuracy of 16 percentage points was achieved, and for a network with 20 k parameters, a decrease of 8 points was achieved. To obtain the best results, knowledge distillation and quantization-aware training methods were used for training. Thanks to this, the accuracy of the 4-bit networks did not differ significantly from the 8-bit ones and their results were approximately four percentage points worse than those of the full precision networks. For the fully connected network, synthesis to ASIC (application-specific integrated circuit) was also performed to demonstrate the reduction in the silicon area occupied by the model. The 4-bit quantization limits the silicon area footprint by 90%.

Journal Article

Share this book

Add to My Shelf

Technological Vanguard: the outstanding performance of the LTY-CNN model for the early prediction of epileptic seizures

by Gao, Fei , Chen, Xing , Yu, Zhangjun in Accuracy , Algorithms , Analysis

2024

Background: Epilepsy is a common neurological disorder that affects approximately 60 million people worldwide. Characterized by unpredictable neural electrical activity abnormalities, it results in seizures with varying intensity levels. Electroencephalography (EEG), as a crucial technology for monitoring and predicting epileptic seizures, plays an essential role in improving the quality of life for people with epilepsy. Method: This study introduces an innovative deep learning model, a lightweight triscale yielding convolutional neural network” (LTY-CNN), that is specifically designed for EEG signal analysis. The model integrates a parallel convolutional structure with a multihead attention mechanism to capture complex EEG signal features across multiple scales and enhance the efficiency achieved when processing time series data. The lightweight design of the LTY-CNN enables it to maintain high performance in environments with limited computational resources while preserving the interpretability and maintainability of the model. Results: In tests conducted on the SWEC-ETHZ and CHB-MIT datasets, the LTY-CNN demonstrated outstanding performance. On the SWEC-ETHZ dataset, the LTY-CNN achieved an accuracy of 99.9%, an area under the receiver operating characteristic curve (AUROC) of 0.99, a sensitivity of 99.9%, and a specificity of 98.8%. Furthermore, on the CHB-MIT dataset, it recorded an accuracy of 99%, an AUROC of 0.932, a sensitivity of 99.1%, and a specificity of 93.2%. These results signify the remarkable ability of the LTY-CNN to distinguish between epileptic seizures and nonseizure events. Compared to other existing epilepsy detection classifiers, the LTY-CNN attained higher accuracy and sensitivity. Conclusion: The high accuracy and sensitivity of the LTY-CNN model demonstrate its significant potential for epilepsy management, particularly in terms of predicting and mitigating epileptic seizures. Its value in personalized treatments and widespread clinical applications reflects the broad prospects of deep learning in the health care sector. This also highlights the crucial role of technological innovation in enhancing the quality of life experienced by patients.

Journal Article

Share this book

Add to My Shelf

Configurable Multi-Layer Perceptron-Based Soft Sensors on Embedded Field Programmable Gate Arrays: Targeting Diverse Deployment Goals in Fluid Flow Estimation

by Klann, Theodor Mario , Einhaus, Lukas , Qian, Chao in Analysis , Automation , Case studies

2025

This study presents a comprehensive workflow for developing and deploying Multi-Layer Perceptron (MLP)-based soft sensors on embedded FPGAs, addressing diverse deployment objectives. The proposed workflow extends our prior research by introducing greater model adaptability. It supports various configurations—spanning layer counts, neuron counts, and quantization bitwidths—to accommodate the constraints and capabilities of different FPGA platforms. The workflow incorporates a custom-developed, open-source toolchain ElasticAI.Creator that facilitates quantization-aware training, integer-only inference, automated accelerator generation using VHDL templates, and synthesis alongside performance estimation. A case study on fluid flow estimation was conducted on two FPGA platforms: the AMD Spartan-7 XC7S15 and the Lattice iCE40UP5K. For precision-focused and latency-sensitive deployments, a six-layer, 60-neuron MLP accelerator quantized to 8 bits on the XC7S15 achieved an MSE of 56.56, an MAPE of 1.61%, and an inference latency of 23.87 μs. Moreover, for low-power and energy-constrained deployments, a five-layer, 30-neuron MLP accelerator quantized to 8 bits on the iCE40UP5K achieved an inference latency of 83.37 μs, a power consumption of 2.06 mW, and an energy consumption of just 0.172 μJ per inference. These results confirm the workflow’s ability to identify optimal FPGA accelerators tailored to specific deployment requirements, achieving a balanced trade-off between precision, inference latency, and energy efficiency.

Journal Article

Share this book

Add to My Shelf

Compressing deep neural networks on FPGAs to binary and ternary precision with hls4ml

by Di Guglielmo, Giuseppe , Wu, Zhenbin , Summers, Sioni in Artificial neural networks , Circuit design , Consumption

2021

We present the implementation of binary and ternary neural networks in the hls4ml library, designed to automatically convert deep neural network models to digital circuits with field-programmable gate arrays (FPGA) firmware. Starting from benchmark models trained with floating point precision, we investigate different strategies to reduce the network's resource consumption by reducing the numerical precision of the network parameters to binary or ternary. We discuss the trade-off between model accuracy and resource consumption. In addition, we show how to balance between latency and accuracy by retaining full precision on a selected subset of network components. As an example, we consider two multiclass classification tasks: handwritten digit recognition with the MNIST data set and jet identification with simulated proton-proton collisions at the CERN Large Hadron Collider. The binary and ternary implementation has similar performance to the higher precision implementation while using drastically fewer FPGA resources.

Journal Article

Share this book

Add to My Shelf

Neuron-by-Neuron Quantization for Efficient Low-Bit QNN Training

by Limonova, Elena , Nikolaev, Dmitry , Sher, Artem in Accuracy , Datasets , Efficiency

2023

Quantized neural networks (QNNs) are widely used to achieve computationally efficient solutions to recognition problems. Overall, eight-bit QNNs have almost the same accuracy as full-precision networks, but working several times faster. However, the networks with lower quantization levels demonstrate inferior accuracy in comparison to their classical analogs. To solve this issue, a number of quantization-aware training (QAT) approaches were proposed. In this paper, we study QAT approaches for two- to eight-bit linear quantization schemes and propose a new combined QAT approach: neuron-by-neuron quantization with straight-through estimator (STE) gradient forwarding. It is suitable for quantizations with two- to eight-bit widths and eliminates significant accuracy drops during training, which results in better accuracy of the final QNN. We experimentally evaluate our approach on CIFAR-10 and ImageNet classification and show that it is comparable to other approaches for four to eight bits and outperforms some of them for two to three bits while being easier to implement. For example, the proposed approach to three-bit quantization of the CIFAR-10 dataset results in 73.2% accuracy, while baseline direct and layer-by-layer result in 71.4% and 67.2% accuracy, respectively. The results for two-bit quantization for ResNet18 on the ImageNet dataset are 63.69% for our approach and 61.55% for the direct baseline.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter