Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
7
result(s) for
"Nicopoulos, C"
Sort by:
On the Effects of Process Variation in Network-on-Chip Architectures
2010
The advent of diminutive technology feature sizes has led to escalating transistor densities. Burgeoning transistor counts are casting a dark shadow on modern chip design: global interconnect delays are dominating gate delays and affecting overall system performance. Networks-on-Chip (NoC) are viewed as a viable solution to this problem because of their scalability and optimized electrical properties. However, on-chip routers are susceptible to another artifact of deep submicron technology, Process Variation (PV). PV is a consequence of manufacturing imperfections, which may lead to degraded performance and even erroneous behavior. In this work, we present the first comprehensive evaluation of NoC susceptibility to PV effects, and we propose an array of architectural improvements in the form of a new router design-called SturdiSwitch-to increase resiliency to these effects. Through extensive reengineering of critical components, SturdiSwitch provides increased immunity to PV while improving performance and increasing area and power efficiency.
Journal Article
Low-Power Data Streaming in Systolic Arrays with Bus-Invert Coding and Zero-Value Clock Gating
2023
Systolic Array (SA) architectures are well suited for accelerating matrix multiplications through the use of a pipelined array of Processing Elements (PEs) communicating with local connections and pre-orchestrated data movements. Even though most of the dynamic power consumption in SAs is due to multiplications and additions, pipelined data movement within the SA constitutes an additional important contributor. The goal of this work is to reduce the dynamic power consumption associated with the feeding of data to the SA, by synergistically applying bus-invert coding and zero-value clock gating. By exploiting salient attributes of state-of-the-art CNNs, such as the value distribution of the weights, the proposed SA applies appropriate encoding only to the data that exhibits high switching activity. Similarly, when one of the inputs is zero, unnecessary operations are entirely skipped. This selectively targeted, application-aware encoding approach is demonstrated to reduce the dynamic power consumption of data streaming in CNN applications using Bfloat16 arithmetic by 1%-19%. This translates to an overall dynamic power reduction of 6.2%-9.4%.
Reduced-Precision Floating-Point Arithmetic in Systolic Arrays with Skewed Pipelines
by
Filippas, D
,
Peltekis, C
,
Nicopoulos, C
in
Deep learning
,
Floating point arithmetic
,
Hardware
2023
The acceleration of deep-learning kernels in hardware relies on matrix multiplications that are executed efficiently on Systolic Arrays (SA). To effectively trade off deep-learning training/inference quality with hardware cost, SA accelerators employ reduced-precision Floating-Point (FP) arithmetic. In this work, we demonstrate the need for new pipeline organizations to reduce latency and improve energy efficiency of reduced-precision FP operators for the chained multiply-add operation imposed by the structure of the SA. The proposed skewed pipeline design reorganizes the pipelined operation of the FP multiply-add units to enable new forwarding paths for the exponent logic, which allow for parallel execution of the pipeline stages of consecutive PEs. As a result, the latency of the matrix multiplication operation within the SA is significantly reduced with minimal hardware cost, thereby yielding an energy reduction of 8% and 11% for the examined state-of-the-art CNNs.
ArrayFlex: A Systolic Array Architecture with Configurable Transparent Pipelining
by
Filippas, D
,
Peltekis, C
,
Nicopoulos, C
in
Arrays
,
Artificial neural networks
,
Configurations
2023
Convolutional Neural Networks (CNNs) are the state-of-the-art solution for many deep learning applications. For maximum scalability, their computation should combine high performance and energy efficiency. In practice, the convolutions of each CNN layer are mapped to a matrix multiplication that includes all input features and kernels of each layer and is computed using a systolic array. In this work, we focus on the design of a systolic array with configurable pipeline with the goal to select an optimal pipeline configuration for each CNN layer. The proposed systolic array, called ArrayFlex, can operate in normal, or in shallow pipeline mode, thus balancing the execution time in cycles and the operating clock frequency. By selecting the appropriate pipeline configuration per CNN layer, ArrayFlex reduces the inference latency of state-of-the-art CNNs by 11%, on average, as compared to a traditional fixed-pipeline systolic array. Most importantly, this result is achieved while using 13%-23% less power, for the same applications, thus offering a combined energy-delay-product efficiency between 1.4x and 1.8x.
Central versus peripheral sympathetic inhibition on arterial blood pressure and heart rate variability in essential hypertension
OR-31
Key Words: Sympathetic Inhibition, Heart Rate Variability, Antihypertensive Treatment
Journal Article
IndexMAC: A Custom RISC-V Vector Instruction to Accelerate Structured-Sparse Matrix Multiplications
by
Peltekis, C
,
Titopoulos, V
,
Nicopoulos, C
in
Array processors
,
Artificial neural networks
,
Hardware
2023
Structured sparsity has been proposed as an efficient way to prune the complexity of modern Machine Learning (ML) applications and to simplify the handling of sparse data in hardware. The acceleration of ML models - for both training and inference - relies primarily on equivalent matrix multiplications that can be executed efficiently on vector processors or custom matrix engines. The goal of this work is to incorporate the simplicity of structured sparsity into vector execution, thereby accelerating the corresponding matrix multiplications. Toward this objective, a new vector index-multiply-accumulate instruction is proposed, which enables the implementation of lowcost indirect reads from the vector register file. This reduces unnecessary memory traffic and increases data locality. The proposed new instruction was integrated in a decoupled RISCV vector processor with negligible hardware cost. Extensive evaluation demonstrates significant speedups of 1.80x-2.14x, as compared to state-of-the-art vectorized kernels, when executing layers of varying sparsity from state-of-the-art Convolutional Neural Networks (CNNs).
The Case for Asymmetric Systolic Array Floorplanning
by
Peltekis, C
,
Nicopoulos, C
,
Dimitrakopoulos, G
in
Artificial neural networks
,
Asymmetry
,
Deep learning
2023
The widespread proliferation of deep learning applications has triggered the need to accelerate them directly in hardware. General Matrix Multiplication (GEMM) kernels are elemental deep-learning constructs and they inherently map onto Systolic Arrays (SAs). SAs are regular structures that are well-suited for accelerating matrix multiplications. Typical SAs use a pipelined array of Processing Elements (PEs), which communicate with local connections and pre-orchestrated data movements. In this work, we show that the physical layout of SAs should be asymmetric to minimize wirelength and improve energy efficiency. The floorplan of the SA adjusts better to the asymmetric widths of the horizontal and vertical data buses and their switching activity profiles. It is demonstrated that such physically asymmetric SAs reduce interconnect power by 9.1% when executing state-of-the-art Convolutional Neural Network (CNN) layers, as compared to SAs of the same size but with a square (i.e., symmetric) layout. The savings in interconnect power translate, in turn, to 2.1% overall power savings.