Catalogue Search | MBRL

DGX-A100 Face to Face DGX-2—Performance, Power and Thermal Behavior Evaluation

by Jansík, Branislav , Špeťko, Matej , Vysocký, Ondřej in Artificial intelligence , Communication , DGX-2

2021

Nvidia is a leading producer of GPUs for high-performance computing and artificial intelligence, bringing top performance and energy-efficiency. We present performance, power consumption, and thermal behavior analysis of the new Nvidia DGX-A100 server equipped with eight A100 Ampere microarchitecture GPUs. The results are compared against the previous generation of the server, Nvidia DGX-2, based on Tesla V100 GPUs. We developed a synthetic benchmark to measure the raw performance of floating-point computing units including Tensor Cores. Furthermore, thermal stability was investigated. In addition, Dynamic Frequency and Voltage Scaling (DVFS) analysis was performed to determine the best energy-efficient configuration of the GPUs executing workloads of various arithmetical intensities. Under the energy-optimal configuration the A100 GPU reaches efficiency of 51 GFLOPS/W for double-precision workload and 91 GFLOPS/W for tensor core double precision workload, which makes the A100 the most energy-efficient server accelerator for scientific simulations in the market.

Journal Article

Share this book

Add to My Shelf

A universal method for designing low-power carbon nanotube FET-based multiple-valued logic circuits

by Doostaregan, Akbar , Moaiyeri, Mohammad Hossein , Navi, Keivan in Applied sciences , binary gates , carbon nanotube field effect transistors

2013

This study presents new low-power multiple-valued logic (MVL) circuits for nanoelectronics. These carbon nanotube field effect transistor (FET) (CNTFET)-based MVL circuits are designed based on the unique characteristics of the CNTFET device such as the capability of setting the desired threshold voltages by adopting correct diameters for the nanotubes as well as the same carrier mobility for the P- and N-type devices. These characteristics make CNTFETs very suitable for designing high-performance multiple-Vth circuits. The proposed MVL circuits are designed based on the conventional CMOS architecture and by utilising inherently binary gates. Moreover, each of the proposed CNTFET-based ternary circuits includes all the possible types of ternary logic, that is, negative, positive and standard, in one structure. The method proposed in this study is a universal technique for designing MVL logic circuits with any arbitrary number of logic levels, without static power dissipation. The results of the simulations, conducted using Synopsys HSPICE with 32 nm-CNTFET technology, demonstrate improvements in terms of power consumption, energy efficiency, robustness and specifically static power dissipation with respect to the other state-of-the-art ternary and quaternary circuits.

Journal Article

Share this book

Add to My Shelf

Amdahl's law in the context of heterogeneous many-core systems – a survey

by Xia, Fei , Rafiev, Ashur , Yakovlev, Alex in Amdahl's law onwards , Communication , computation capability improvement index

2020

For over 50 years, Amdahl's Law has been the hallmark model for reasoning about performance bounds for homogeneous parallel computing resources. As heterogeneous, many-core parallel resources continue to permeate into the modern server and embedded domains, there has been growing interest in promulgating realistic extensions and assumptions in keeping with newer use cases. This study aims to provide a comprehensive review of the purviews and insights provided by the extensive body of work related to Amdahl's law to date, focusing on computation speedup. The authors show that a significant portion of these studies has looked into analysing the scalability of the model considering both workload and system heterogeneity in real-world applications. The focus has been to improve the definition and semantic power of the two key parameters in the original model: the parallel fraction (f) and the computation capability improvement index (n). More recently, researchers have shown normal-form and multi-fraction extensions that can account for wider ranges of heterogeneity, validated on many-core systems running realistic workloads. Speedup models from Amdahl's law onwards have seen a wide range of uses, such as the optimisation of system execution, and these uses are even more important with the advent of the heterogeneous many-core era.

Journal Article

Share this book

Add to My Shelf

Adventures Beyond Amdahl’s Law: How Power-Performance Measurement and Modeling at Scale Drive Server and Supercomputer Design

by Cameron, Kirk W. in Artificial Intelligence , Computer Science , Data Structures and Information Theory

2023

Amdahl’s Law painted a bleak picture for large-scale computing. The implication was that parallelism was limited and therefore so was potential speedup. While Amdahl’s contribution was seminal and important, it drove others vested in parallel processing to define more clearly why large-scale systems are critical to our future and how they fundamentally provide opportunities for speedup beyond Amdahl’s predictions. In the early 2000s, much like Amdahl, we predicted dire consequences for large-scale systems due to power limits. While our early work was often dismissed, the implications were clear to some: power would ultimately limit performance. In this retrospective, we discuss how power-performance measurement and modeling at scale led to contributions that have driven server and supercomputer design for more than a decade. While the influence of these techniques is now indisputable, we discuss their connections, limits and additional research directions necessary to continue the performance gains our industry is accustomed to.

Journal Article

Share this book

Add to My Shelf

HEALERS: a heterogeneous energy-aware low-overhead real-time scheduler

by Devaraj, Rajesh , Sarkar, Arnab , Moulik, Sanjay in Algorithms , Deadlines , DVFS

2019

Devising energy-efficient scheduling strategies for real-time periodic tasks on heterogeneous platforms is a challenging as well as a computationally demanding problem. This study proposes a low-overhead heuristic strategy called, HEALERS, for dynamic voltage and frequency scaling (DVFS)-cum-dynamic power management (DPM) enabled energy-aware scheduling of a set of periodic tasks executing on a heterogeneous multi-core system. The presented strategy first applies deadline-partitioning to acquire a set of distinct time-slices. At any time-slice boundary, the following three-phase operations are applied to obtain a schedule for the next time-slice: first, it computes the fragments of the execution demands of all tasks onto each of the different processing cores in the platform. Next, it generates a schedule for each task on one or more processing cores such that the total execution demand of all tasks is satisfied. Finally, HEALERS applies DVFS and DPM on all processing cores so that energy consumption within the time-slice may be minimized while not jeopardising execution requirements of the scheduled tasks. Experimental results show that the proposed scheme is not only able to achieve appreciable energy savings with respect to state-of-the-art (5–42% on average) but also enables a significant improvement in resource utilisation (as high as 58%).

Journal Article

Share this book

Add to My Shelf

P-EdgeCoolingMode: an agent-based performance aware thermal management unit for DVFS enabled heterogeneous MPSoCs

by McDonald-Maier, Klaus Dieter , Dey, Somdip , Singh, Amit Kumar in agent‐based performance aware thermal management unit , Algorithms , cooling

2019

Thermal cycling, as well as spatial and thermal gradient, affects the lifetime reliability and performance of heterogeneous Multi-Processor Systems-on-Chips (MPSoCs). Conventional temperature management techniques are not intelligent enough to cater for performance, energy efficiency as well as the operating temperature of the system. In this study, the authors propose a light-weight novel thermal management mechanism (P-EdgeCoolingMode) in the form of intelligent software agent, which monitors and regulates the operating temperature of the CPU cores to improve the reliability of the system while catering for performance requirements. P-EdgeCoolingMode is capable of pro-actively monitoring performance and based on the user's demand the agent takes necessary action, making the proposed methodology highly suitable for implementation on existing as well as conceptual Edge devices utilising heterogeneous MPSoCs with dynamic voltage and frequency scaling (DVFS) capabilities. They validated the authors’ methodology on the Odroid-XU4 MPSoC and Huawei P20 Lite (HiSilicon Kirin 659 MPSoC). P-EdgeCoolingMode has been successful in reducing the operating temperature while improving performance and reducing power consumption for chosen test cases than the state-of-the-art. For applications with demanding performance requirement P-EdgeCoolingMode has been found to improve the power consumption by 30.62% at the most in comparison to existing state-of-the-art power management methodologies.

Journal Article

Share this book

Add to My Shelf

Evaluation of simulator tools and power-aware scheduling model for wireless sensor networks

by Chéour, Rym , Abid, Mohamed , Kanoun, Olfa in Access control , Accuracy , autonomous WSN

2017

The sharp increase of the wireless sensor networks (WSNs) performance has increased their power requirements. However, with a limited battery lifetime it is more and more difficult to deploy many more sensors with today's solutions. Therefore, the authors need to implement autonomous WSNs without any human intervention or external power supply. To this end, this study proposes an effective strategy to ensure an energy consumption gain that takes into account time constraints through a power-aware model based on the dynamic voltage and frequency scaling and the dynamic power management that are appropriate to the WSNs and on a global Earliest Deadline First scheduler. To select the most suitable simulator to integrate and simulate the developed models, >25 of the existing WSN simulators are outlined and evaluated. On the basis of this comparative study analysis, the authors chose the simulation tool for real-time multiprocessor scheduling (STORM) to validate their work for its multiple advantages.

Journal Article

Share this book

Add to My Shelf

Simplifying and implementing service level objectives for stream parallelism

by Danelutto, Marco , Fernandes, Luiz G , Vogel, Adriano in Adaptive systems , Algorithms , Design

2020

An increasing attention has been given to provide service level objectives (SLOs) in stream processing applications due to the performance and energy requirements, and because of the need to impose limits in terms of resource usage while improving the system utilization. Since the current and next-generation computing systems are intrinsically offering parallel architectures, the software has to naturally exploit the architecture parallelism. Implement and meet SLOs on existing applications is not a trivial task for application programmers, since the software development process, besides the parallelism exploitation, requires the implementation of autonomic algorithms or strategies. This is a system-oriented programming approach and requires the management of multiple knobs and sensors (e.g., the number of threads to use, the clock frequency of the cores, etc.) so that the system can self-adapt at runtime. In this work, we introduce a new and simpler way to define SLO in the application’s source code, by abstracting from the programmer all the details relative to self-adaptive system implementation. The application programmer specifies which parts of the code to parallelize and the related SLOs that should be enforced. To reach this goal, source-to-source code transformation rules are implemented in our compiler, which automatically generates self-adaptive strategies to enforce, at runtime, the user-expressed objectives. The experiments highlighted promising results with simpler, effective, and efficient SLO implementations for real-world applications.

Journal Article

Share this book

Add to My Shelf

Exploiting memory allocations in clusterised many-core architectures

by Reis, Ricardo , Gamatié, Abdoulaye , Ost, Luciano in Architecture , centralised shared memory solution , Data exchange

2019

Power-efficient architectures have become the most important feature required for future embedded systems. Modern designs, like those released on mobile devices, reveal that clusterisation is the way to improve energy efficiency. However, such architectures are still limited by the memory subsystem (i.e. memory latency problems). This work investigates an alternative approach that exploits on-chip data locality to a large extent, through distributed shared memory systems that permit efficient reuse of on-chip mapped data in clusterised many-core architectures. First, this work reviews the current literature on memory allocations and explores the limitations of cluster-based many-core architectures. Then, several memory allocations are introduced and benchmarked scalability, performance and energy-wise against the conventional centralised shared memory solution in order to reveal which memory allocation is the most appropriate for future mobile architectures. The results show that distributed shared memory allocations bring performance gains and opportunities to reduce energy consumption.

Journal Article

Share this book

Add to My Shelf

Joint optimisation for time consumption and energy consumption of multi-application and load balancing of cloudlets in mobile edge computing

by Pan, Wenjie , Wang, Jiabin , Huang, Hualong in Augmented reality , Cloud computing , Computation offloading

2020

Mobile edge computing (MEC) is an effective assistant technology that can overcome some defects of cloud computing. For the sake of alleviating the clashes between the capability constraint of cloudlets and the needs of mobile devices (MDs) for reducing executing latency as well as decreasing the power consumption of MDs, a user-oriented use case in the MEC named computation offloading is taken into consideration. Computation offloading is capable of effectively making the MEC adapt to the resources of cloudlets and MDs in different environments, and it is very beneficial to the development of the internet of things. Owing to the finite computation capabilities of the MDs and the resources of cloudlets are heterogeneous and limited; a three-objective model is established to optimise the time consumption, and the energy consumption of MDs as well as the load balancing of cloudlets jointly. Technically, the authors propose an effective multi-user multi-application computation offloading method in the multi-cloudlet environment on the basis of improved non-dominated sorting genetic algorithm III. Finally, comprehensive experiments and analysis were conducted to validate the effectiveness and efficiency of the proposed method.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter