Catalogue Search | MBRL

Parallel deblocking filter for HEVC on many-core processor

by Yan, Chenggang , Li, Liang , Dai, Qionghai in Applied sciences , Artificial intelligence , Coding

2014

High-efficiency video coding (HEVC) is the next generation standard of video coding. The deblocking filter (DF) constitutes a significant part of the HEVC decoder complexity. A three-step parallel framework (TPF) is proposed for the H.264/AVC DF, which is also suitable for HEVC except the third step. The third step of the TPF is replaced with a directed acyclic graph-based order. Experiments show that the proposed method dramatically accelerates more than the state-of-the-art parallel method.

Journal Article

Share this book

Add to My Shelf

Toward efficient structured-grid triangular solver on sunway many-core processors

by Liang, Jiabi , Hu, Zhengding , Shi, Jinliang in Algorithms , Compilers , Computer Science

2024

The sparse triangular solver (SpTRSV) is mostly used for scientific and engineering applications. The structured-grid triangular solver of regular dependencies (STRSV) is a special kind of SpTRSV. Some general SpTRSVs that disregards the regularity of the matrix are unsuitable for solving this problem. This paper proposes an efficient parallel algorithm for STRSV on the SW26010 (a kind of China independently designed many-core processors), namely swStructTRSV. The algorithm makes full use of the fine-grained and low latency communication characteristics of the SW26010 to reduce the waiting time for synchronization, maximizes the regularity of access to improve memory access bandwidth, and achieves overlap between memory access and computation simultaneously. Moreover, the idea of the algorithm can be extended to incomplete LU factorization (ILU factorization) because of consistent dependencies. The experimental results on a core group(8 * 8 network composed of 64 cores) of SW26010 show that swStructTRSV can achieve an average speedup of over 30 in the sequential version. swStructTRSV on SW26010 achieves solving speedups of 2.2 and 6.3 over the fast STRSV (fSpTRSV) previously implemented on SW26010 and MKL on Intel Xeon Gold 6132, respectively. swStructTRSV significantly outperforms cuSparse on NVIDIA TITAN RTX in terms of overall execution time.

Journal Article

Share this book

Add to My Shelf

HDSAP: heterogeneity-aware dynamic scheduling algorithm to improve performance of nanoscale many-core processors for unknown workloads

by Kia, Keihaneh , Rajabzadeh, Amir in Algorithms , Communication , Compilers

2023

The performance growth in processors has been continuing toward increasing the number of processing cores on the chip and scaling the feature size of transistors. However, in the nanoera, side effects of the scaling, such as induced heterogeneities in the performance, power, and soft error rate of identically designed cores, prevent the potential performance from being fully utilized. In this paper, we harness the mentioned side effects in shared-memory multicore processors with unknown workloads by a dynamic heuristic scheduling algorithm called HDSAP. HDSAP aims to maximize performance, i.e., the average response time, under power and reliability constraints in presence of induced heterogeneities. In this regard, we use a mathematical model to quantify task to core assignments based on performance variation. We also consider the variation in power to change selected cores when the power constraint is missed. To meet the reliability constraint, we use N-modular redundancy while being aware of the variation in the soft error rate of cores to prevent under/over reliability estimation. To evaluate HDSAP, we run SPLASH benchmark suite on Sniper and MACPat simulators. As a result, the response time of HDSAP reduces by 6%, 8%, and 25% in comparison with similar algorithms under the same power and reliability constraints.

Journal Article

Share this book

Add to My Shelf

Parallel optimization of Monte Carlo neutron transport method based on Sunway Bluelight II supercomputer

by Wang, Chengzhi , Guo, Ying , Pan, Jingshan in Efficiency , Floating point arithmetic , Monte Carlo simulation

2025

High-performance computing is crucial for complex nuclear energy simulations, and the Monte Carlo method is one of the most precise methods among them. Based on the Sunway Bluelight II supercomputer, the general heterogeneous two-level parallel optimization method is proposed for the open-source Monte Carlo neutron transport code (OpenMC). Thread-level parallel optimization includes direct parallel optimization, computational data optimization and load balancing optimization. In process-level optimization, a communication optimization method suitable for Sunway chip hardware architecture is proposed. Subsequently, comprehensive tests are conducted on two different test models, B&W 1484 Core 1 and BEAVRS, using different data scales. Results demonstrate significant performance improvements: the optimized code achieves sustained floating-point performance up to 5.34 TFLOPS. Within a single-core group, neutron transport simulations of the B&W 1484 Core 1 model and the BEAVRS model achieve speedups of 25.12 and 20.29 times, respectively. Particularly, when the two-level parallel optimization program is expanded to 2048 processes (2048 MPE + 131,072 CPE), the strong scalability of the B&W 1484 Core 1 model reaches 82.68%. When executing the BEAVRS benchmark problem, the weak scalability is nearly linear. Moreover, when using 2048 processes to execute computational tasks of the same scale, the two-level parallel optimization program of the Sunway Bluelight II supercomputer takes a similar amount of time as the standard MPI + OpenMP parallel program of the Shanhe supercomputer. Our work not only improves the efficiency of neutron transport simulations, but also provides reference value for other parallel optimization research on the Sunway supercomputer.

Journal Article

Share this book

Add to My Shelf

Parallel optimization of method of characteristics based on Sunway Bluelight II supercomputer

by Tian, Min , Chen, Renjiang , Liu, Zhaoyuan in Compilers , Computer Science , Coordinate transformations

2023

With the development of nuclear energy technology, reactor physical calculations have higher requirements for calculation accuracy and speed, and it has become an inevitable trend to use high-performance computers for reactor simulation calculations. The method of characteristics (MOC) is currently recognized as the preferred method for simulating neutron transport in the nuclear reactor core. Based on the architecture of Sunway many-core processor and Sunway Bluelight II supercomputer, this paper proposes a fine grained and universal two-level parallelization, including thread-level parallelization and process-level parallelization. In the thread-level parallelization, the methods such as job pipeline optimization, load balancing across CPEs, and I/O optimization are proposed for acceleration. In the process-level parallelization, a mapping method from software to hardware is proposed. This method can make full use of the hardware of Sunway supercomputers and improve the computing efficiency and data transmission efficiency. For the first time, the OpenMOC program is transplanted and parallelly optimized on the Sunway supercomputers, which enriched the application ecology of Sunway supercomputers. Compared with the original program, the two-level parallelization can achieve up to 18.6x speedup. Moreover, our parallelization is capable to run on more than 3750 processes of Sunway Bluelight II supercomputer with good strong and weak scalability.

Journal Article

Share this book

Add to My Shelf

AMT: asynchronous in-place matrix transpose mechanism for sunway many-core processor

by Chen, Zhengbo , Zheng, Fang , Chen, Zuoning in Algorithms , Efficiency , Logic design

2022

Matrix multiplication is widely used in a variety of application domains. When the input matrices and the product differ in the memory format, matrix transpose is required. The efficiency of matrix transpose has a non-negligible impact on performance. However, the state-of-the-art software solution and its optimizations suffer from low efficiency due to frequent interference to main pipeline and their inability to achieve parallel matrix transpose and multiplication. To address this issue, we propose AMT, an asynchronous and in-place matrix transpose mechanism based on C2R algorithm, to efficiently perform matrix transpose. AMT performs matrix transpose in an asynchronous processing module and uses two customized asynchronous matrix transpose instructions to facilitate processing. We implement the logic design of AMT using RTL and verify its correctness. Simulation results show that AMT achieves an average of 1.27x (up to 1.48x) speedup over a state-of-the-art software baseline, and is within 95.4% of an ideal method. Overhead analysis shows that AMT only incurs small area overhead and power consumption.

Journal Article

Share this book

Add to My Shelf

Design of Parallel Algorithm for Kalman Filter on SW26010 Processors

by Xu, Dandan , Yang, Aiqiang in Algorithms , Buffers , Concurrency

2021

Kalman filter algorithm, an effective data processing algorithm, has been widely used in space monitoring, wireless communications, tracking systems, the financial industry, and so on. On the Sunway TaihuLight platform, we present an improved Kalman filter parallel algorithm which is according to the new architecture of the SW26010 many-core processors (260 cores) and new programming mode (master and slave heterogeneous collaboration mode). Furthermore, we propose a pipelined parallel mode for the KF algorithm based on a seven-level pipeline of the SW26010 processor. The vector optimization strategy and double buffering mechanisms are provided to improve the parallel efficiency of Kalman filter parallel algorithm on SW26010 processors. The vector optimization strategy can improve data concurrency in parallel computing. In addition, the communication time can be hidden by double buffering mechanisms of SW26010 processors. The experimental results show that the performance and scalability of the parallel Kalman filter algorithm based on SW26010 are greatly improved compared with the CPU algorithm for five different data sets, and is also improved compared to the algorithm on GPU.

Journal Article

Share this book

Add to My Shelf

Evaluation by Neutron Radiation of the NMR-MPar Fault-Tolerance Approach Applied to Applications Running on a 28-nm Many-Core Processor

by Velazco, Raoul , Vargas, Vanessa , Ramos, Pablo in Avionics , CMOS , Error correction

2018

Currently, there is a special interest in validating the use of Commercial-Off-The-Shelf (COTS) multi/many-core processors for critical applications thanks to their high performance, low power consumption and affordability. However, the continuous shrinking of transistor geometry and the increasing complexity of these devices dramatically affect their sensitivity to natural radiation, and thus diminish their reliability. One of the most common effects produced by natural radiation is the Single Event Upset which is the bit-flip of a memory content producing unexpected results at application-level. For this reason, manufacturers and users implement hardware and software error-mitigation techniques on multi/many-core processors. In this context, the present work aims at evaluating a new fault-tolerance approach based on N-Modular redundancy (NMR) and partitioning called NMR-MPar by means of 14 MeV neutron radiation ground testing in order to emulate the effects of high-energy neutrons present at avionics altitudes. For evaluation purposes, a case-study is implemented on the 28 nm CMOS KALRAY MPPA-256 many-core processor running two complementary benchmarks applications: a distributed Matrix Multiplication and the Travel Salesman Problem. Radiation experiments were conducted in GENEPI2 particle-accelerator. The correctness of the results of the application when an error is detected confirms the approach’s effectiveness and boosts their usage on avionics applications.

Journal Article

Share this book

Add to My Shelf

Performance modeling and optimization of parallel LU-SGS on many-core processors for 3D high-order CFD simulations

by Xiong, Min , Deng, Xiaogang , Cheng, Bin in Algorithms , Decomposition , Domain decomposition methods

2017

As a typical Gauss–Seidel method, the inherent strong data dependency of lower-upper symmetric Gauss–Seidel (LU-SGS) poses tough challenges for shared-memory parallelization. On early multi-core processors, the pipelined parallel LU-SGS approach achieves promising scalability. However, on emerging many-core processors such as Xeon Phi, experience from our in-house high-order CFD program show that the parallel efficiency drops dramatically to less than 25%. In this paper, we model and analyze the performance of the pipelined parallel LU-SGS algorithm, present a two-level pipeline (TL-Pipeline) approach using nested OpenMP to further exploit fine-grained parallelisms and mitigate the parallel performance bottlenecks. Our TL-Pipeline approach achieves 20% performance gains for a regular problem (256×256×256) on Xeon Phi. We also discuss some practical problems including domain decomposition and algorithm parameters tuning for realistic CFD simulations. Generally, our work is applicable to the shared-memory parallelization of all Gauss–Seidel like methods with intrinsic strong data dependency.

Journal Article

Share this book

Add to My Shelf

Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture

by 郑方李宏亮吕晖过锋许晓红谢向辉 in Architecture (computers) , Artificial Intelligence , Bandwidths

2015

Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core （DFMC） for high performance computing systems. DFMC integrates management processing ele- ments （MPEs） and computing processing elements （CPEs）, which are heterogeneous processor cores for different application features with a unified ISA （instruction set architecture）, a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM （double-precision matrix multiplication） achieving an efficiency of 94%, FFT （fast Fourier transform） obtaining a performance of 207 GFLOPS and FDTD （finite-difference time-domain） obtaining a performance of 27 GFLOPS.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter