Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
96
result(s) for
"many‐core processor"
Sort by:
Parallel deblocking filter for HEVC on many-core processor
by
Yan, Chenggang
,
Li, Liang
,
Dai, Qionghai
in
Applied sciences
,
Artificial intelligence
,
Coding
2014
High-efficiency video coding (HEVC) is the next generation standard of video coding. The deblocking filter (DF) constitutes a significant part of the HEVC decoder complexity. A three-step parallel framework (TPF) is proposed for the H.264/AVC DF, which is also suitable for HEVC except the third step. The third step of the TPF is replaced with a directed acyclic graph-based order. Experiments show that the proposed method dramatically accelerates more than the state-of-the-art parallel method.
Journal Article
Toward efficient structured-grid triangular solver on sunway many-core processors
2024
The sparse triangular solver (SpTRSV) is mostly used for scientific and engineering applications. The structured-grid triangular solver of regular dependencies (STRSV) is a special kind of SpTRSV. Some general SpTRSVs that disregards the regularity of the matrix are unsuitable for solving this problem. This paper proposes an efficient parallel algorithm for STRSV on the SW26010 (a kind of China independently designed many-core processors), namely swStructTRSV. The algorithm makes full use of the fine-grained and low latency communication characteristics of the SW26010 to reduce the waiting time for synchronization, maximizes the regularity of access to improve memory access bandwidth, and achieves overlap between memory access and computation simultaneously. Moreover, the idea of the algorithm can be extended to incomplete LU factorization (ILU factorization) because of consistent dependencies. The experimental results on a core group(8 * 8 network composed of 64 cores) of SW26010 show that swStructTRSV can achieve an average speedup of over 30 in the sequential version. swStructTRSV on SW26010 achieves solving speedups of 2.2 and 6.3 over the fast STRSV (fSpTRSV) previously implemented on SW26010 and MKL on Intel Xeon Gold 6132, respectively. swStructTRSV significantly outperforms cuSparse on NVIDIA TITAN RTX in terms of overall execution time.
Journal Article
HDSAP: heterogeneity-aware dynamic scheduling algorithm to improve performance of nanoscale many-core processors for unknown workloads
2023
The performance growth in processors has been continuing toward increasing the number of processing cores on the chip and scaling the feature size of transistors. However, in the nanoera, side effects of the scaling, such as induced heterogeneities in the performance, power, and soft error rate of identically designed cores, prevent the potential performance from being fully utilized. In this paper, we harness the mentioned side effects in shared-memory multicore processors with unknown workloads by a dynamic heuristic scheduling algorithm called HDSAP. HDSAP aims to maximize performance, i.e., the average response time, under power and reliability constraints in presence of induced heterogeneities. In this regard, we use a mathematical model to quantify task to core assignments based on performance variation. We also consider the variation in power to change selected cores when the power constraint is missed. To meet the reliability constraint, we use N-modular redundancy while being aware of the variation in the soft error rate of cores to prevent under/over reliability estimation. To evaluate HDSAP, we run SPLASH benchmark suite on Sniper and MACPat simulators. As a result, the response time of HDSAP reduces by 6%, 8%, and 25% in comparison with similar algorithms under the same power and reliability constraints.
Journal Article
Parallel optimization of Monte Carlo neutron transport method based on Sunway Bluelight II supercomputer
by
Wang, Chengzhi
,
Guo, Ying
,
Pan, Jingshan
in
Efficiency
,
Floating point arithmetic
,
Monte Carlo simulation
2025
High-performance computing is crucial for complex nuclear energy simulations, and the Monte Carlo method is one of the most precise methods among them. Based on the Sunway Bluelight II supercomputer, the general heterogeneous two-level parallel optimization method is proposed for the open-source Monte Carlo neutron transport code (OpenMC). Thread-level parallel optimization includes direct parallel optimization, computational data optimization and load balancing optimization. In process-level optimization, a communication optimization method suitable for Sunway chip hardware architecture is proposed. Subsequently, comprehensive tests are conducted on two different test models, B&W 1484 Core 1 and BEAVRS, using different data scales. Results demonstrate significant performance improvements: the optimized code achieves sustained floating-point performance up to 5.34 TFLOPS. Within a single-core group, neutron transport simulations of the B&W 1484 Core 1 model and the BEAVRS model achieve speedups of 25.12 and 20.29 times, respectively. Particularly, when the two-level parallel optimization program is expanded to 2048 processes (2048 MPE + 131,072 CPE), the strong scalability of the B&W 1484 Core 1 model reaches 82.68%. When executing the BEAVRS benchmark problem, the weak scalability is nearly linear. Moreover, when using 2048 processes to execute computational tasks of the same scale, the two-level parallel optimization program of the Sunway Bluelight II supercomputer takes a similar amount of time as the standard MPI + OpenMP parallel program of the Shanhe supercomputer. Our work not only improves the efficiency of neutron transport simulations, but also provides reference value for other parallel optimization research on the Sunway supercomputer.
Journal Article
Parallel optimization of method of characteristics based on Sunway Bluelight II supercomputer
by
Tian, Min
,
Chen, Renjiang
,
Liu, Zhaoyuan
in
Compilers
,
Computer Science
,
Coordinate transformations
2023
With the development of nuclear energy technology, reactor physical calculations have higher requirements for calculation accuracy and speed, and it has become an inevitable trend to use high-performance computers for reactor simulation calculations. The method of characteristics (MOC) is currently recognized as the preferred method for simulating neutron transport in the nuclear reactor core. Based on the architecture of Sunway many-core processor and Sunway Bluelight II supercomputer, this paper proposes a fine grained and universal two-level parallelization, including thread-level parallelization and process-level parallelization. In the thread-level parallelization, the methods such as job pipeline optimization, load balancing across CPEs, and I/O optimization are proposed for acceleration. In the process-level parallelization, a mapping method from software to hardware is proposed. This method can make full use of the hardware of Sunway supercomputers and improve the computing efficiency and data transmission efficiency. For the first time, the OpenMOC program is transplanted and parallelly optimized on the Sunway supercomputers, which enriched the application ecology of Sunway supercomputers. Compared with the original program, the two-level parallelization can achieve up to 18.6x speedup. Moreover, our parallelization is capable to run on more than 3750 processes of Sunway Bluelight II supercomputer with good strong and weak scalability.
Journal Article
AMT: asynchronous in-place matrix transpose mechanism for sunway many-core processor
2022
Matrix multiplication is widely used in a variety of application domains. When the input matrices and the product differ in the memory format, matrix transpose is required. The efficiency of matrix transpose has a non-negligible impact on performance. However, the state-of-the-art software solution and its optimizations suffer from low efficiency due to frequent interference to main pipeline and their inability to achieve parallel matrix transpose and multiplication. To address this issue, we propose AMT, an asynchronous and in-place matrix transpose mechanism based on C2R algorithm, to efficiently perform matrix transpose. AMT performs matrix transpose in an asynchronous processing module and uses two customized asynchronous matrix transpose instructions to facilitate processing. We implement the logic design of AMT using RTL and verify its correctness. Simulation results show that AMT achieves an average of 1.27x (up to 1.48x) speedup over a state-of-the-art software baseline, and is within 95.4% of an ideal method. Overhead analysis shows that AMT only incurs small area overhead and power consumption.
Journal Article
Design of Parallel Algorithm for Kalman Filter on SW26010 Processors
2021
Kalman filter algorithm, an effective data processing algorithm, has been widely used in space monitoring, wireless communications, tracking systems, the financial industry, and so on. On the Sunway TaihuLight platform, we present an improved Kalman filter parallel algorithm which is according to the new architecture of the SW26010 many-core processors (260 cores) and new programming mode (master and slave heterogeneous collaboration mode). Furthermore, we propose a pipelined parallel mode for the KF algorithm based on a seven-level pipeline of the SW26010 processor. The vector optimization strategy and double buffering mechanisms are provided to improve the parallel efficiency of Kalman filter parallel algorithm on SW26010 processors. The vector optimization strategy can improve data concurrency in parallel computing. In addition, the communication time can be hidden by double buffering mechanisms of SW26010 processors. The experimental results show that the performance and scalability of the parallel Kalman filter algorithm based on SW26010 are greatly improved compared with the CPU algorithm for five different data sets, and is also improved compared to the algorithm on GPU.
Journal Article
Evaluation by Neutron Radiation of the NMR-MPar Fault-Tolerance Approach Applied to Applications Running on a 28-nm Many-Core Processor
2018
Currently, there is a special interest in validating the use of Commercial-Off-The-Shelf (COTS) multi/many-core processors for critical applications thanks to their high performance, low power consumption and affordability. However, the continuous shrinking of transistor geometry and the increasing complexity of these devices dramatically affect their sensitivity to natural radiation, and thus diminish their reliability. One of the most common effects produced by natural radiation is the Single Event Upset which is the bit-flip of a memory content producing unexpected results at application-level. For this reason, manufacturers and users implement hardware and software error-mitigation techniques on multi/many-core processors. In this context, the present work aims at evaluating a new fault-tolerance approach based on N-Modular redundancy (NMR) and partitioning called NMR-MPar by means of 14 MeV neutron radiation ground testing in order to emulate the effects of high-energy neutrons present at avionics altitudes. For evaluation purposes, a case-study is implemented on the 28 nm CMOS KALRAY MPPA-256 many-core processor running two complementary benchmarks applications: a distributed Matrix Multiplication and the Travel Salesman Problem. Radiation experiments were conducted in GENEPI2 particle-accelerator. The correctness of the results of the application when an error is detected confirms the approach’s effectiveness and boosts their usage on avionics applications.
Journal Article
Performance modeling and optimization of parallel LU-SGS on many-core processors for 3D high-order CFD simulations
by
Xiong, Min
,
Deng, Xiaogang
,
Cheng, Bin
in
Algorithms
,
Decomposition
,
Domain decomposition methods
2017
As a typical Gauss–Seidel method, the inherent strong data dependency of lower-upper symmetric Gauss–Seidel (LU-SGS) poses tough challenges for shared-memory parallelization. On early multi-core processors, the pipelined parallel LU-SGS approach achieves promising scalability. However, on emerging many-core processors such as Xeon Phi, experience from our in-house high-order CFD program show that the parallel efficiency drops dramatically to less than 25%. In this paper, we model and analyze the performance of the pipelined parallel LU-SGS algorithm, present a two-level pipeline (TL-Pipeline) approach using nested OpenMP to further exploit fine-grained parallelisms and mitigate the parallel performance bottlenecks. Our TL-Pipeline approach achieves 20% performance gains for a regular problem (256×256×256) on Xeon Phi. We also discuss some practical problems including domain decomposition and algorithm parameters tuning for realistic CFD simulations. Generally, our work is applicable to the shared-memory parallelization of all Gauss–Seidel like methods with intrinsic strong data dependency.
Journal Article
Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture
Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing ele- ments (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS.
Journal Article