Catalogue Search | MBRL

The Sunway TaihuLight supercomputer： system and applications

by Haohuan FU Junfeng LIAO Jinzhe YANG Lanning WANG Zhenya SONG Xiaomeng HUANG Chao YANG Wei XUE Fangfang LIU Fangli QIAO Wei ZHAO Xunqiang YIN Chaofeng HOU Chenglong ZHANG Wei GE Jian ZHANG Yangang WANG Chunbo ZHOU Guangwen YANG in Central processing units , Computation , Computer memory

2016

The Sunway TaihuLight supercomputer is the world＇s first system with a peak performance greater than 100 PFlops. In this paper, we provide a detailed introduction to the TaihuLight system. In contrast with other existing heterogeneous supercomputers, which include both CPU processors and PCIe-connected many-core accelerators （NVIDIA GPU or Intel Xeon Phi）, the computing power of TaihuLight is provided by a homegrown many-core SW26010 CPU that includes both the management processing elements （MPEs） and computing processing elements （CPEs） in one chip. With 260 processing elements in one CPU, a single SW26010 provides a peak performance of over three TFlops. To alleviate the memory bandwidth bottleneck in most applications, each CPE comes with a scratch pad memory, which serves as a user-controlled cache. To support the parallelization of programs on the new many-core architecture, in addition to the basic C/C＋＋ and Fortran compilers, the system provides a customized Sunway OpenACC tool that supports the OpenACC 2.0 syntax. This paper also reports our preliminary efforts on developing and optimizing applications on the TaihuLight system, focusing on key application domains, such as earth system modeling, ocean surface wave modeling, atomistic simulation, and phase-field simulation.

Journal Article

Share this book

Add to My Shelf

Experimental realization of single-shot nonadiabatic holonomic gates in nuclear spins

by Hang Li Yang Liu GuiLu Long in Astronomy , Circuit design , Classical and Continuum Physics

2017

Nonadiabatic holonomic quantum computation has received increasing attention due to its robustness against control errors. However, all the previous schemes have to use at least two sequentially implemented gates to realize a general one-qubit gate. Based on two recent reports, we construct two Hamiltonians and experimentally realized nonadiabatic holonomic gates by a single-shot implementation in a two-qubit nuclear magnetic resonance （NMR） system. Two noncommuting one-qubit holonomic gates, rotating along .~ and ~ axes respectively, are implemented by evolving a work qubit and an ancillary qubit nonadiabatically following a quantum circuit designed. Using a sequence compiler developed for NMR quantum information processor, we optimize the whole pulse sequence, minimizing the total error of the implementation. Finally, all the nonadiabatic holonomic gates reach high unattenuated experimental fidelities over 98%.

Journal Article

Share this book

Add to My Shelf

Research on functional verification method processor model built by Chisel

by CHEN, Fu , WANG, Miao , WU, Lening in arm架构 , chisel , 处理器模型验证

2023

With the increasing complexity of hardware design, verification has become the difficulty of chip design. In order to effectively shorten the overall working time of the design process, it is necessary to work out a method to quickly find design errors in the verification that takes up a lot of time in the design. The design under test is an ARM Chisel compatible with the ARM V4 instruction set architecture (ISA) processor model. The processor model is built with a new hardware language Chisel and is a highly complex hardware design. Based on this embedded processor model, ①a random instruction generator supporting all instructions of the ARM V4 ISA architecture is designed to increase the speed of generating test stimuli; ②based on the characteristics of the new construction language Chisel, designed for the processor model under test four verification stages: primary verification at the Chisel level, rapid verification of coverage, direct test verification and verification of complex applications, to ensure that the expected coverage is achieved; ③built in the Chisel environment and Verilog environment based on the embedded processor model Test platform. The test platform can quickly and accurately find errors and locate errors while collecting coverage, which improves the verification speed. Finally, the FPGA acceleration method is used to accelerate the verification of large-scale application programs and shorten the verification cycle. 随着航空硬件设计复杂度的提高, 芯片验证技术已经成为了芯片设计的难点。为了有效缩短设计流程的总体工作时间, 有必要在占据设计大量时间的验证中, 研究出快速寻找设计错误的方法。被测设计是兼容ARM V4指令集架构(instruction set architecture, ISA)的处理器模型ARMChisel, 该处理器模型采用新型的硬件语言Chisel构建, 是一个具有高复杂性的硬件设计。基于这一嵌入式处理器模型: ①设计了支持ARM V4 ISA架构全部指令的随机指令生成器, 提高了生成测试激励的速度; ②根据新型构建语言Chisel的特点, 针对被测处理器模型设计了Chisel层面初级验证、覆盖率快速验证、直接测试验证和复杂应用程序验证策略, 确保达到预期的覆盖率; ③在Chisel环境和Verilog环境中搭建了基于嵌入式处理器模型的测试平台, 测试平台收集覆盖率同时能快速准确地发现错误并定位错误, 提高了验证速度。采用FPGA(field programmable gute array)方法加速大型应用程序的验证, 缩短了验证周期。

Journal Article

Share this book

Add to My Shelf

Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture

by 郑方李宏亮吕晖过锋许晓红谢向辉 in Architecture (computers) , Artificial Intelligence , Bandwidths

2015

Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core （DFMC） for high performance computing systems. DFMC integrates management processing ele- ments （MPEs） and computing processing elements （CPEs）, which are heterogeneous processor cores for different application features with a unified ISA （instruction set architecture）, a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM （double-precision matrix multiplication） achieving an efficiency of 94%, FFT （fast Fourier transform） obtaining a performance of 207 GFLOPS and FDTD （finite-difference time-domain） obtaining a performance of 27 GFLOPS.

Journal Article

Share this book

Add to My Shelf

Drosha and Dicer： Slicers cut from the same cloth

by Sisi Li Dinshaw J Patel in 631/337/384/331 , 631/45/535 , 631/45/612/1242

2016

DROSHA and its partner DGCR8 form a heterotrimeric complex named Microprocessor, which is essential for microRNA biogenesis. A recent study by Kwon et al. in Cell reveals the structure of a DROSHA construct in complex with the C-terminal region of DGCR8, thereby unveiling the topology and interactions between components of the Microprocessor and insights into its ＇ruler＇-based cleavage activity and function.

Journal Article

Share this book

Add to My Shelf

Darwin：a neuromorphic hardware co-processor based on Spiking Neural Networks

by Juncheng SHEN De MA Zonghua GU Ming ZHANG Xiaolei ZHU Xiaoqiang XU Qi XU Yangjing SHEN Gang PAN in Artificial neural networks , CMOS技术 , Computer Science

2016

Broadly speaking, the goal of neuromorphic engineering is to build computer systems that mimic the brain. Spiking Neural Network（SNN） is a type of biologically-inspired neural networks that perform information processing based on discrete-time spikes, different from traditional Artificial Neural Network（ANN）.Hardware implementation of SNNs is necessary for achieving high-performance and low-power. We present the Darwin Neural Processing Unit（NPU）, a neuromorphic hardware co-processor based on SNN implemented with digitallogic, supporting a maximum of 2048 neurons, 20482= 4194304 synapses, and 15 possible synaptic delays.The Darwin NPU was fabricated by standard 180 nm CMOS technology with an area size of 5 × 5 mm2and70 MHz clock frequency at the worst case. It consumes 0.84 m W/MHz with 1.8 V power supply for typical applications. Two prototype applications are used to demonstrate the performance and efficiency of the hardware implementation.

Journal Article

Share this book

Add to My Shelf

g-good-neighbor conditional diagnosability of star graph networks under PMC model and MM model

by Wang, Shiying , Wang, Zhenhua , Wang, Mujiangshan in Fault diagnosis , Graph theory , Multiprocessing

2017

Diagnosability of a multiprocessor system is an important study topic. S. L. Peng, C. K. Lin, J. J. M. Tan, and L. H. Hsu [Appl. Math. Comput., 2012, 218(21): 10406-10412] proposed a new measure for fault diagnosis of the system, which is called the g-good-neighbor conditional diagnosability that restrains every fault-free node containing at least g fault-free neighbors. As a famous topological structure of interconnection networks, the n-dimensional star graph S n has many good properties. In this paper, we establish the g-good-neighbor conditional diagnosability of S n under the PMC model and MM* model.

Journal Article

Share this book

Add to My Shelf

An Intra-Server Interconnect Fabric for Heterogeneous Computing

by 曹政刘小丽李强刘小兵王展安学军 in Artificial Intelligence , Computation , Computer Science

2014

With the increasing diversity of application needs and computing units, the server with heterogeneous pro- cessors is more and more widespread. However, conventional SMP/ccNUMA server architecture introduces communication bottleneck between heterogeneous processors and only uses heterogeneous processors as coprocessors, which limits the efficiency and flexibility of using heterogeneous processors. To solve this problem, this paper proposes an intra-server inter- connect fabric that supports both intra-server peer-to-peer interconnection and I/O resource sharing among heterogeneous processors. By connecting processors and I/O devices with the proposed fabric, heterogeneous processors can perform direct communication with each other and run in stand-alone mode with shared intra-server resources. We design the proposed fabric by extending the de-facto system I/O bus protocol PCIe （Peripheral Computer Interconnect Express） and implement it with a single chip cZodiac. By making full use of PCIe＇s original advantages, the interconnection and the I/O sharing mechanism are light weight and efficient. Evaluations that have been carried out on both the FPGA （Field Programmable Gate Array） prototype and the cycle-accurate simulator demonstrate that our design is feasible and scalable. In addition, our design is suitable for not only the heterogeneous server but also the high density server.

Journal Article

Share this book

Add to My Shelf

Preface

by Wen-Guang Chen in 数字信号处理 , 网络路由 , 计算机体系结构

2017

Dataflow architecture is a kind of computer architecture that contrasts the traditional yon Neumann architecture or control flow architecture. Although it is not commercially successful in general-purpose computer processor market as yet, the concepts of dataflow have been used in many areas such as digital signal processing, network routing and scientific computing, as well as parallel computing frameworks.

Journal Article

Share this book

Add to My Shelf

High-speed visual target tracking with mixed rotation invariant description and skipping searching

by Yongxing YANG Jie YANG Zhongxing ZHANG Liyuan LIU Nanjian WU in Actuators , Algorithms , Boundary conditions

2017

This paper proposes a novel high-speed visual target tracking system based on mixed rotation invariant description（MRID） and skipping searching method. MRID is a novel rotation invariant description of texture and edge information by annular histograms and dominant direction. It overcomes rotation variant and large computation issues in conventional LBP-HOG feature description. The skipping searching method used in tracking can remarkably decrease the computation time by avoiding repeated searching operations.The proposed tracking system contains an image sensor, a hierarchical vision processor and an actuator with2 dimensions of freedom（DOF）. The vision processor integrates processors with pixel-and row-level parallelism to speed up the tracking algorithm. Experiment results show that the proposed system can achieve over 1000-fps processing speed of the tracking algorithm under 750 × 480 resolution image.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter