Catalogue Search | MBRL

Limits on fundamental limits to computation

by Markov, Igor L. in 639/166/987 , 639/705/117 , 639/766/25

2014

To evaluate the promise of potential computing technologies, this review examines a wide range of fundamental limits, such as to performance, power consumption, size and cost, from the device level to the system level. Probing the limits to computing power Computers have evolved at a remarkable rate, with improvements over the past fifty years roughly in line with Gordon Moore's prescient observation that the number of transistors in a dense integrated circuit would double approximately every two years. The rate of 'Moore scaling' is slowing down and other physical limits are looming, but new technologies such as carbon nanotubes, graphene and quantum computation are on the way. In this Review, Igor Markov takes a fresh look at the fundamental limits at various levels, from devices to complete systems, and compares loose and tight limits. Markov argues that the study of the limits of fundamental limits to computation can lead to new insights for emerging technologies. An indispensable part of our personal and working lives, computing has also become essential to industries and governments. Steady improvements in computer hardware have been supported by periodic doubling of transistor densities in integrated circuits over the past fifty years. Such Moore scaling now requires ever-increasing efforts, stimulating research in alternative hardware and stirring controversy. To help evaluate emerging technologies and increase our understanding of integrated-circuit scaling, here I review fundamental limits to computation in the areas of manufacturing, energy, physical space, design and verification effort, and algorithms. To outline what is achievable in principle and in practice, I recapitulate how some limits were circumvented, and compare loose and tight limits. Engineering difficulties encountered by emerging technologies may indicate yet unknown limits.

Journal Article

Share this book

Add to My Shelf

Equivalent-accuracy accelerated neural-network training using analogue memory

by Shelby, Robert M. , Narayanan, Pritish , Burr, Geoffrey W. in 639/166/987 , 639/705/258 , 639/766/119/995

2018

Neural-network training can be slow and energy intensive, owing to the need to transfer the weight data for the network between conventional digital memory chips and processor chips. Analogue non-volatile memory can accelerate the neural-network training algorithm known as backpropagation by performing parallelized multiply–accumulate operations in the analogue domain at the location of the weight data. However, the classification accuracies of such in situ training using non-volatile-memory hardware have generally been less than those of software-based training, owing to insufficient dynamic range and excessive weight-update asymmetry. Here we demonstrate mixed hardware–software neural-network implementations that involve up to 204,900 synapses and that combine long-term storage in phase-change memory, near-linear updates of volatile capacitors and weight-data transfer with ‘polarity inversion’ to cancel out inherent device-to-device variations. We achieve generalization accuracies (on previously unseen data) equivalent to those of software-based training on various commonly used machine-learning test datasets (MNIST, MNIST-backrand, CIFAR-10 and CIFAR-100). The computational energy efficiency of 28,065 billion operations per second per watt and throughput per area of 3.6 trillion operations per second per square millimetre that we calculate for our implementation exceed those of today’s graphical processing units by two orders of magnitude. This work provides a path towards hardware accelerators that are both fast and energy efficient, particularly on fully connected neural-network layers. Analogue-memory-based neural-network training using non-volatile-memory hardware augmented by circuit simulations achieves the same accuracy as software-based training but with much improved energy efficiency and speed.

Journal Article

Share this book

Add to My Shelf

A compute-in-memory chip based on resistive random-access memory

by Qian, He , Kubendran, Rajkumar , Zhang, Wenqiang in 142/126 , 639/166/987 , 639/705/117

2022

Realizing increasingly complex artificial intelligence (AI) functionalities directly on edge devices calls for unprecedented energy efficiency of edge hardware. Compute-in-memory (CIM) based on resistive random-access memory (RRAM) 1 promises to meet such demand by storing AI model weights in dense, analogue and non-volatile RRAM devices, and by performing AI computation directly within RRAM, thus eliminating power-hungry data movement between separate compute and memory 2 – 5 . Although recent studies have demonstrated in-memory matrix-vector multiplication on fully integrated RRAM-CIM hardware 6 – 17 , it remains a goal for a RRAM-CIM chip to simultaneously deliver high energy efficiency, versatility to support diverse models and software-comparable accuracy. Although efficiency, versatility and accuracy are all indispensable for broad adoption of the technology, the inter-related trade-offs among them cannot be addressed by isolated improvements on any single abstraction level of the design. Here, by co-optimizing across all hierarchies of the design from algorithms and architecture to circuits and devices, we present NeuRRAM—a RRAM-based CIM chip that simultaneously delivers versatility in reconfiguring CIM cores for diverse model architectures, energy efficiency that is two-times better than previous state-of-the-art RRAM-CIM chips across various computational bit-precisions, and inference accuracy comparable to software models quantized to four-bit weights across various AI tasks, including accuracy of 99.0 percent on MNIST 18 and 85.7 percent on CIFAR-10 19 image classification, 84.7-percent accuracy on Google speech command recognition 20 , and a 70-percent reduction in image-reconstruction error on a Bayesian image-recovery task. A compute-in-memory neural-network inference accelerator based on resistive random-access memory simultaneously improves energy efficiency, flexibility and accuracy compared with existing hardware by co-optimizing across all hierarchies of the design.

Journal Article

Share this book

Add to My Shelf

Experimentally validated memristive memory augmented neural network with efficient hashing and similarity search

by Mao, Ruibin , Kazemi, Arman , Wong, Ngai in 639/166/987 , 639/925/927/1007 , Algorithms

2022

Lifelong on-device learning is a key challenge for machine intelligence, and this requires learning from few, often single, samples. Memory-augmented neural networks have been proposed to achieve the goal, but the memory module must be stored in off-chip memory, heavily limiting the practical use. In this work, we experimentally validated that all different structures in the memory-augmented neural network can be implemented in a fully integrated memristive crossbar platform with an accuracy that closely matches digital hardware. The successful demonstration is supported by implementing new functions in crossbars, including the crossbar-based content-addressable memory and locality sensitive hashing exploiting the intrinsic stochasticity of memristor devices. Simulations show that such an implementation can be efficiently scaled up for one-shot learning on more complex tasks. The successful demonstration paves the way for practical on-device lifelong learning and opens possibilities for novel attention-based algorithms that were not possible in conventional hardware. Memory augmented neural network for lifelong on-device learning is bottlenecked by limited bandwidth in conventional hardware. Here, the authors demonstrate its efficient in-memristor realization with a close-software accuracy, supported by hashing and similarity search in crossbars.

Journal Article

Share this book

Add to My Shelf

Embedded SOPC design with NIOS II processor and VHDL examples

by Chu, Pong P. in Computer input-output equipment , Field programmable gate arrays , Systems on a chip

2011

The book is divided into four major parts. Part I covers HDL constructs and synthesis of basic digital circuits. Part II provides an overview of embedded software development with the emphasis on low-level I/O access and drivers. Part III demonstrates the design and development of hardware and software for several complex I/O peripherals, including PS2 keyboard and mouse, a graphic video controller, an audio codec, and an SD (secure digital) card. Part IV provides three case studies of the integration of hardware accelerators, including a custom GCD (greatest common divisor) circuit, a Mandelbrot set fractal circuit, and an audio synthesizer based on DDFS (direct digital frequency synthesis) methodology. The book utilizes FPGA devices, Nios II soft-core processor, and development platform from Altera Co., which is one of the two main FPGA manufactures. Altera has a generous university program that provides free software and discounted prototyping boards for educational institutions (details at www.altera.com/university ). The two main educational prototyping boards are known as DE1 ($99) and DE2 ($269). All experiments can be implemented and tested with these boards. A board combined with this book becomes a \"turn-key\" solution for the SoPC design experiments and projects. Most HDL and C codes in the book are device independent and can be adapted by other prototyping boards as long as a board has similar I/O configuration.

eBook

Share this book

Add to My Shelf

Spike-based dynamic computing with asynchronous sensing-computing neuromorphic chip

by Demirci, Tugba , Wu, Chenxi , Richter, Ole in 639/705/1042 , 639/705/117 , Action Potentials - physiology

2024

By mimicking the neurons and synapses of the human brain and employing spiking neural networks on neuromorphic chips, neuromorphic computing offers a promising energy-efficient machine intelligence. How to borrow high-level brain dynamic mechanisms to help neuromorphic computing achieve energy advantages is a fundamental issue. This work presents an application-oriented algorithm-software-hardware co-designed neuromorphic system for this issue. First, we design and fabricate an asynchronous chip called “Speck”, a sensing-computing neuromorphic system on chip. With the low processor resting power of 0.42mW, Speck can satisfy the hardware requirements of dynamic computing: no-input consumes no energy. Second, we uncover the “dynamic imbalance” in spiking neural networks and develop an attention-based framework for achieving the algorithmic requirements of dynamic computing: varied inputs consume energy with large variance. Together, we demonstrate a neuromorphic system with real-time power as low as 0.70mW. This work exhibits the promising potentials of neuromorphic computing with its asynchronous event-driven, sparse, and dynamic nature. Mimicking high-level abstraction of the brain to achieve energy advantages is a fundamental issue in neuromorphic computing. Here, the authors fabricate an asynchronous chip and demonstrate a high-accuracy neuromorphic system with power consumption of 0.7mW.

Journal Article

Share this book

Add to My Shelf

Comprehensive analysis of energy efficiency and performance of ARM and RISC-V SoCs

by Almeida, Francisco , Blanco, Vicente , Suárez, Daniel in Benchmarks , Chips (electronics) , Comparative analysis

2024

Over the past few years, ARM has been the dominant player in embedded systems and System-on-Chips (SoCs). With the emergence of hardware platforms based on the RISC-V architecture, a practical comparison focusing on their energy efficiency and performance is needed. In this study, our goal is to comprehensively evaluate the energy efficiency and performance of ARM and RISC-V SoCs in three different systems. We will conduct benchmark tests to measure power consumption and overall system performance. The results of our study are valuable to developers and researchers looking for the most appropriate hardware platform for energy-efficient computing applications. Our observations suggest that RISC-V Instruction Set Architecture (ISA) implementations may demonstrate lower average power consumption than ARM, but this does not automatically imply a superior performance per watt ratio for RISC-V. The primary focus of the study is to evaluate and compare these ISA implementations, aiming to identify potential areas for enhancing their energy efficiency. Furthermore, to ensure the practical applicability of our findings, we will use the Computational Fluid Dynamics software OpenFOAM. This step serves to validate the relevance of our results in real-world scenarios. It allows us to fine-tune execution parameters based on the insights gained from our initial study. By doing so, we aim not only to provide meaningful conclusions but also to investigate the transferability of our results to practical applications. Our analysis will also scrutinize the capabilities of these SoCs when handling nonsynthetic software workloads, thereby broadening the scope of our evaluation.

Journal Article

Share this book

Add to My Shelf

A high-speed reusable quantized hardware accelerator design for CNN on constrained edge device

by Yanamala, Rama Muni Reddy , Pullakandam, Muralidhar in Accuracy , Algorithms , Analysis

2023

The most recent deep learning technique used in many applications is the convolutional neural network (CNN). Recent years have seen a rise in demand for real-time CNN implementations on various embedded devices with restricted resources. The CNN models should be implemented using field-programmable gate arrays to ensure flexible programmability and speed up the development process. However, the CNN acceleration is hampered by complex computations, limited bandwidth, and on-chip memory storage. In this paper, a reusable quantized hardware architecture was proposed to accelerate deep CNN models by solving the above issues. Twenty-five processing elements are employed for the computation of convolutions in the CNN model. Pipelining, loop unrolling, and array partitioning are the techniques for increasing the speed of computations in both the convolution layers and fully connected layers. This design is tested with MNIST handwritten digit image classification on a low-cost, low-memory Xilinx PYNQ-Z2 system on chip edge device. The inference speed of the proposed hardware design achieved 92.7% higher than INTEL core3 CPU, 90.7% more than Haswell core2 CPU, 87.7% more than NVIDIA Tesla K80 GPU, and 84.9% better when compared to the conventional hardware accelerator with one processing element. The proposed quantized architecture design has achieved the performance of 4.4 GOP/s without compromising the accuracy and it was 2 times more than the conventional architecture.

Journal Article

Share this book

Add to My Shelf

Cycle-accurate multi-FPGA platform for accelerated emulation of large on-chip networks

by Romanov, Aleksandr Y. , Amerikanov, Aleksandr A. , Lerner, Anatoly in Accuracy , Compilers , Computer Science

2024

On-chip networks (NoCs) have become a popular choice for designing large multiprocessor architectures. Software-based emulation is often used to perform the design verification. However, if the considered design is sufficiently large, software-based emulation becomes impractically slow. To avoid this limitation, multi-FPGA emulation was introduced, where multiple interconnected FPGAs collectively emulate a single circuit. The number of external FPGA pins is often insufficient for emulating large network-on-chip designs accurately. As a result, the overall emulation frequency has to be severely limited. We propose a method for accelerating multi-FPGA emulation by reducing the amount of data FPGAs need to transmit to each other. To achieve cycle-accurate emulation in the absence of constant transmission latency, synchronous messaging is implemented. The proposed method was tested on a functioning prototype. It is shown that the use of our method for multi-FPGA emulation of large NoC designs can reach several orders.

Journal Article

Share this book

Add to My Shelf

Variable and Extended Precision (VRP) Accelerator Implemented in a 22 nm SoC

by Fereyre, Jérôme , Bocco, Andrea , Guthmuller, Eric in circuits and systems , CMOS integrated circuits , Computer Arithmetic

2025

Linear solvers and eigensolvers are the heart of high‐performance computing scientific applications. Among them, iterative projection methods are preferred to direct algorithms for large problems because of their lower memory usage, but they are prone to roundoff errors. Using an enhanced working precision inside the linear computing kernels mitigates this issue and accelerates convergence. Today, to go beyond 80 bits of precision, the only option is to use software libraries which are very slow. We introduce the variable and extended precision accelerator (VRP), a RISC‐V accelerator implemented on a system‐on‐chip (SoC) using GF22FDX technology. The VRP supports floating point computations with a range of significand bits from 2 to 512. This accelerator delivers an average 19.25× $\\times$application speedup compared to the well‐known MPFR software library running on a 2400 MHz Intel Xeon processor. Additionally, extended precision facilitates the convergence of linear solvers for problems that would otherwise fail to converge and reduces energy‐to‐solution. We implemented a variable extended precision accelerator, with support of up to 512‐bit precision, in a 22 nm system‐on‐chip. We demonstrate its ability to improve convergence of linear solver by running conjugate gradient methods on the chip at 800 MHz working frequency. By measuring power consumption, we demonstrate that extended precision not only has the potential to improve time‐to‐solution, but also energy‐to‐solution on difficult‐to‐solve scientific computing problems.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter