Catalogue Search | MBRL

The Sunway TaihuLight supercomputer： system and applications

by Haohuan FU Junfeng LIAO Jinzhe YANG Lanning WANG Zhenya SONG Xiaomeng HUANG Chao YANG Wei XUE Fangfang LIU Fangli QIAO Wei ZHAO Xunqiang YIN Chaofeng HOU Chenglong ZHANG Wei GE Jian ZHANG Yangang WANG Chunbo ZHOU Guangwen YANG in Central processing units , Computation , Computer memory

2016

The Sunway TaihuLight supercomputer is the world＇s first system with a peak performance greater than 100 PFlops. In this paper, we provide a detailed introduction to the TaihuLight system. In contrast with other existing heterogeneous supercomputers, which include both CPU processors and PCIe-connected many-core accelerators （NVIDIA GPU or Intel Xeon Phi）, the computing power of TaihuLight is provided by a homegrown many-core SW26010 CPU that includes both the management processing elements （MPEs） and computing processing elements （CPEs） in one chip. With 260 processing elements in one CPU, a single SW26010 provides a peak performance of over three TFlops. To alleviate the memory bandwidth bottleneck in most applications, each CPE comes with a scratch pad memory, which serves as a user-controlled cache. To support the parallelization of programs on the new many-core architecture, in addition to the basic C/C＋＋ and Fortran compilers, the system provides a customized Sunway OpenACC tool that supports the OpenACC 2.0 syntax. This paper also reports our preliminary efforts on developing and optimizing applications on the TaihuLight system, focusing on key application domains, such as earth system modeling, ocean surface wave modeling, atomistic simulation, and phase-field simulation.

Journal Article

Share this book

Add to My Shelf

Towards a verified compiler prototype for the synchronous language SIGNAL

by Zhibin YANG Jean-Paul BODEVEIX Mamoun FILALI Kai HU Yongwang ZHAO Dianfu MA in architecture analysis and design language (AADL) , Avionics , Compilers

2016

SIGNAL belongs to the synchronous languages family which are widely used in the design of safety-critical real-time systems such as avionics, space systems, and nu- clear power plants. This paper reports a compiler prototype for SIGNAL. Compared with the existing SIGNAL com- piler, we propose a new intermediate representation （named S-CGA, a variant of clocked guarded actions）, to integrate more synchronous programs into our compiler prototype in the future. The front-end of the compiler, i.e., the transla- tion from SIGNAL to S-CGA, is presented. As well, the proof of semantics preservation is mechanized in the theo- rem prover Coq. Moreover, we present the back-end of the compiler, including sequential code generation and multi- threaded code generation with time-predictable properties. With the rising importance of multi-core processors in safety- critical embedded systems or cyber-physical systems （CPS）, there is a growing need for model-driven generation of multi- threaded code and thus mapping on multi-core. We propose a time-predictable multi-core architecture model in archi- tecture analysis and design language （AADL）, and map the multi-threaded code to this model.

Journal Article

Share this book

Add to My Shelf

Paradoxes in the Textual Development of the Laozi : A Closer Examination of Chapters Eight and Twenty-Four

by CUI Xiaojiao in Ambiguity , Beida Laozi , Laozi (philosopher)

2017

In light of the recently published Western Han period bamboo-slip Laozi, now in the collection of Peking University, this paper explores several paradoxes in the textual development of the Laozi. Specifically, it presents two examples suggesting that since the wording in the Laozi was originally intended to be ambiguous and paradoxical, during the transmission of the text, the compilers or commentators modified some of the paradoxes to make better sense. Eventually those modifications came to replace the original text. In the first part of this article examines certain contrasting differences in Chapter Eight from the Beida Laozi, the Mawangdui Laozi, and the received Laozi. The second part, I examine certain other contrasting differences from these same versions from Chapter Twenty-Four are discussed. This paper argues that these differences among the various versions are not the product of transcribal error; rather, they are the result of compilers or commentators who revised these passages against their earliest versions in order to make the meaning clearer and more explicit.

Journal Article

Share this book

Add to My Shelf

MPtostream： an OpenMP compiler for CPU-GPU heterogeneous parallel systems

by YANG XueJun TANG Tao WANG GuiBin JIA Jia XU XinHai in Central processing units , China , Compilers

2012

In light of GPUs＇ powerful floating-point operation capacity, heterogeneous parallel systems incorporating general purpose CPUs and GPUs have become a highlight in the research field of high performance computing（HPC）. However, due to the complexity of programming on GPUs, porting a large number of existing scientific computing applications to the heterogeneous parallel systems remains a big challenge. The OpenMP programming interface is widely adopted on nmlti-core CPUs in the field of scientific computing. To effectively inherit existing OpenMP applications and reduce the transplant cost, we extend OpenMP with a group of compiler directives, which explicitly divide tasks among the CPU and the GPU, and map time-consuming com- puting fragments to run on the GPU, thus dramatically simplifying the transplantation. We have designed and implemented MPtoStream, a compiler of the extended OpenMP for AMD＇s stream processing GPUs. Our exper- imental results show that programming with the extended directives deviates from programming with OpenMP by less than 11% modification and achieves significant speedup ranging from 3.1 to 17.3 on a heterogeneous system, incorporating an Intel Xeon E5405 CPU and an AMD FireStream 9250 GPU, over the execution on the Xeon CPU alone.

Journal Article

Share this book

Add to My Shelf

Formal verification of synchronous data-flow program transformations toward certified compilers

by Van Chan NGO Jean-Pierre TALPIN Thierry GAUTIER Paul Le GUERNIC Loic BESNARD in certified compiler , Compilers , Computer Science

2013

Translation validation was invented in the 90＇s by Pnueli et al. as a technique to formally verify the correctness of code generators. Rather than certifying the code generator or exhaustively qualifying it, translation validators attempt to verify that program transformations preserve semantics. In this work, we adopt this approach to formally verify that the clock semantics and data dependence are preserved during the compilation of the Signal compiler. Translation valida- tion is implemented for every compilation phase from the initial phase until the latest phase where the executable code is generated, by proving the transformation in each phase of the compiler preserves the semantics. We represent the clock semantics, the data dependence of a program and its trans- formed counterpart as first-order formulas which are called clock models and synchronous dependence graphs （SDGs）, respectively. We then introduce clock refinement and depen- dence refinement relations which express the preservations of clock semantics and dependence, as a relation on clock mod- els and SDGs, respectively. Our validator does not require any instrumentation or modification of the compiler, nor any rewriting of the source program.

Journal Article

Share this book

Add to My Shelf

SWIP Prediction： Complexity-Effective Indirect-Branch Prediction Using Pointers

by 谢子超佟冬黄明凯史秦青程旭 in Accuracy , Artificial Intelligence , Buffers

2012

Predicting indirect-branch targets has become a performance bottleneck for many applications. Previous high- performance indirect-branch predictors usually require significant hardware storage or additional compiler support, which increases the complexity of the processor front-end or the compilers. This paper proposes a complexity-effective indirect- branch prediction mechanism, called the Set-Way Index Pointing （SWIP） prediction. It stores multiple indirect-branch targets in different branch target buffer （BTB） entries, whose set indices and way locations are treated as set-way index pointers. These pointers are stored in the existing branch-direction predictor. SWIP prediction reuses the branch direction predictor to provide such pointers, and then accesses the pointed BTB entries for the predicted indirect-branch target. Our evaluation shows that SWIP prediction could achieve attractive performance improvement without requiring large dedicated storage or additional compiler support. It improves the indirect-branch prediction accuracy by 36.5% compared to that of a commonly-used BTB, resulting in average performance improvement of 18.56%. Its energy consumption is also reduced by 14.34% over that of the baseline.

Journal Article

Share this book

Add to My Shelf

OpenMP compiler for distributed memory architectures

by WANG Jue HU ChangJun ZHANG JiLin LI JianJiang in Algorithms , Arrays , Computational chemistry

2010

OpenMP is an emerging industry standard for shared memory architectures. While OpenMP has advantages on its ease of use and incremental programming, message passing is today still the most widely-used programming model for distributed memory architectures. How to effectively extend OpenMP to distributed memory architectures has been a hot spot. This paper proposes an OpenMP system, called KLCoMP, for distributed memory architectures. Based on the ＂partially replicating shared arrays＂ memory model, we propose an algorithm for shared array recognition based on the inter-procedural analysis, optimization technique based on the producer/consumer relationship, and communication generation technique for nonlinear references. We evaluate the performance on nine benchmarks which cover computational fluid dynamics, integer sorting, molecular dynamics, earthquake simulation, and computational chemistry. The average scalability achieved by KLCoMP version is close to that achieved by MPI version. We compare the performance of our translated programs with that of versions generated for Omni＋SCASH, LLCoMP, and OpenMP（Purdue）, and find that parallel applications （especially, irregular applications） translated by KLCoMP can achieve more effective performance than other versions.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter