Catalogue Search | MBRL

A shared libraries aware and bank partitioning-based mechanism for multicore architecture

by Liu, Gang , Chen, Yucong , Zhou, Rui in Artificial Intelligence , Computational Intelligence , Control

2023

Dynamic random-access memory (DRAM) consists of several banks, which are shared resources among cores. Memory interference is caused by sharing with banks among cores, which results in overall system performance reduction. This will exacerbate the problem because shared libraries are commonly used in modern operating systems. The physical memories used by shared libraries are often distributed throughout all banks in DRAM, and shared library codes are regularly run. This will result in a large number of row-buffer conflicts and a decrease in system performance. This paper proposes a new shared library awareness and bank partitioning-based mechanism (SBM) that takes into account inter-thread interference caused by shared libraries and assigns allocated DRAM banks to specific cores rather than processes, thus taking advantage of bank-level parallelism (BLP) and improving system performance isolation. We conducted several experiments to assess the degree of performance isolation achieved by SBM. The findings indicate that SBM significantly enhanced performance isolation.

Journal Article

Share this book

Add to My Shelf

Time–Energy Correlation for Multithreaded Matrix Factorizations

by Bylina, Beata , Piekarz, Monika in Algorithms , Cholesky factorization , Comparative analysis

2023

The relationship between time and energy is an important aspect related to energy savings in modern multicore architectures. In this paper, we investigated and analyzed the correlation between time and energy. We compared the execution time and energy consumption of the LU factorization algorithms (versions with and without pivoting) and Cholesky with the Math Kernel Library (MKL) on a multicore machine. To reduce the energy of these multithreaded factorizations, the Dynamic Voltage and Frequency Scaling (DVFS) technique was used. This technique allows the clock frequency to be scaled without changing the implementation. In particular, we studied the correlations between time and energy using two metrics: Energy Delay Product (EDP) and Greenup, Powerup, and Speedup (GPS-UP). An experimental evaluation was performed on an Intel Xeon Gold multicore machine as a function of the number of threads and the clock speed. Our test results showed that scalability in terms of execution time, expressed by the Speedup metric, had values close to a linear function as the number of threads increased. In contrast, the scalability in terms of energy consumption, expressed by the Greenup metric, had values close to a logarithmic function as the number of threads increased. The use of the EDP and GPS-UP metrics allowed us to evaluate the impact of the optimized code (DVFS and increase in the number of threads) on the time and energy consumption and to determine a better green category representing energy savings without losing performance.

Journal Article

Share this book

Add to My Shelf

Dataflow-based automatic parallelization of MATLAB/Simulink models for fitting modern multicore architectures

by Gasmi, Kaouther , Hasnaoui, Salam in Algorithms , Communication , Computer architecture

2024

In many fields including aerospace, automotive, and telecommunications, MathWorks’ MATLAB/Simulink is contemporary standard for model-based design. The strengths of Simulink are rapid design and algorithm exploration. Models created with Simulink are just functional. Therefore, designers cannot effortlessly consider a Simulink model’s architecture. As current architectures are optimized to run on multicore processors, software running on these processors needs to be parallelized in order to benefit from their natural performance. For instance, designers need to understand how a Simulink model could be parallelized and how an adequate multicore architecture is selected. This paper focuses on the dataflow-based parallelization of Simulink models and proposes a method based on dataflow to measure the performance of parallelized Simulink models running on multicore architectures. Throughout the parallelization process, the model is converted into a Hierarchical Synchronous DataFlow Graph (HSDFG) keeping its original semantics, and each composite node in the graph is flattened. Then, the graph is mapped and scheduled into a multicore architecture with the ultimate objective that minimizes the end-to-end latency. In the experiment of applying the proposed approach to a real Simulink model, latency of the parallelized model could be reduced successfully on a various multi-core architectures.

Journal Article

Share this book

Add to My Shelf

Accelerating prediction of RNA secondary structure using parallelization on multicore architecture

by Raut, Roshani , Borkar, Pradnya , Raghuwanshi, Mukesh in Algorithms , Bioinformatics , Computer architecture

2023

Due to Covid pandemic, the investigation of bioinformatics, and more specifically the investigation of RNA, has become a major focus of the research. mRNA-based vaccines have been developed by a significant number of academician and scientists. When a virus infects a human body, it first causes disruption to the host’s RNA structure, then it transforms the host’s RNA into its own genetic structure. As a result, it is vital to research on the process of making predictions about the secondary structures of RNA. The process of predicting the secondary structure of long RNA sequences takes a significant amount of time. This study makes a contribution to the implementation of the algorithm that finds the predicted secondary structure. The methodology for RNA secondary prediction that was employed in this study is based on a dynamic programming model that uses shared memory multicore architecture. The Nearest Neighbor Thermodynamic Model (NNTM) is used as the foundation for the determination of the least amount of free energy. The Gutell database has been used for the RNA sequences of a number of different bacterial species. A comparison was made between the amount of time required to identify the secondary structure of RNA. The length of the RNA sequence is a factor that affects the performance of many existing methods, but the proposed GAfold technique handles sequences of any length. Once the secondary structure has been identified, it helped to detect the virus. It is discovered that the GAfold approach speeds up RNA Secondary structure’s prediction. The proposed GAfold algorithm offers a speed increase factor of 2.5 ns on a four-core architecture, 3.09 ns on an eight-core design, and 5.5 ns on a twelve-core architecture. It has been shown that predicting suboptimal structures enhanced the accuracy of the free energy minimization algorithm, which in turn improved the accuracy of full RNA.

Journal Article

Share this book

Add to My Shelf

Efficient Hybrid Parallel Scheme for Caputo Time-Fractional PDEs on Multicore Architectures

by Shams, Mudassir , Carpentieri, Bruno in biomedical fractional models , Boundary conditions , Caputo time-fractional PDEs

2025

We present a hybrid parallel scheme for efficiently solving Caputo time-fractional partial differential equations (CTFPDEs) with integer-order spatial derivatives on multicore CPU and GPU platforms. The approach combines a second-order spatial discretization with the L1 time-stepping scheme and employs MATLAB parfor parallelization to achieve significant reductions in runtime and memory usage. A theoretical third-order convergence rate is established under smooth-solution assumptions, and the analysis also accounts for the loss of accuracy near the initial time t=t0 caused by weak singularities inherent in time-fractional models. Unlike many existing approaches that rely on locally convergent strategies, the proposed method ensures global convergence even for distant or randomly chosen initial guesses. Benchmark problems from fractional biological models—including glucose–insulin regulation, tumor growth under chemotherapy, and drug diffusion in tissue—are used to validate the robustness and reliability of the scheme. Numerical experiments confirm near-linear speedup on up to four CPU cores and show that the method outperforms conventional techniques in terms of convergence rate, residual error, iteration count, and efficiency. These results demonstrate the method’s suitability for large-scale CTFPDE simulations in scientific and engineering applications.

Journal Article

Share this book

Add to My Shelf

Task-Based FMM for Multicore Architectures

by Darve, Eric , Bramas, Bérenger , Coulaud, Olivier in Algorithms , Applied mathematics , Approximation

2014

Fast multipole methods (FMM) are a fundamental operation for the simulation of many physical problems. The high-performance design of such methods usually requires to carefully tune the algorithm for both the targeted physics and the hardware. In this paper, we propose a new approach that achieves high performance across architectures. Our method consists of expressing the FMM algorithm as a task flow and employing a state-of-the-art runtime system, StarPU, to process the tasks on the different computing units. We carefully design the task flow, the mathematical operators, their implementations, and scheduling schemes. Potentials and forces on 200 million particles are computed in 42.3 seconds on a homogeneous 160-core SGI Altix UV 100 and good scalability is shown. [PUBLICATION ABSTRACT]

Journal Article

Share this book

Add to My Shelf

Multicore Photonic Complex-Valued Neural Network with Transformation Layer

by Yu, Hongyan , Zhou, Xuliang , Zhang, Yejin in Accuracy , Algorithms , Classification

2022

Photonic neural network chips have been widely studied because of their low power consumption, high speed and large bandwidth. Using amplitude and phase to encode, photonic chips can accelerate complex-valued neural network computations. In this article, a photonic complex-valued neural network (PCNN) chip is designed. The scale of the single-core PCNN chip is limited because of optical losses, and the multicore architecture of the chip is used to improve computing capability. Further, for improving the performance of the PCNN, we propose the transformation layer, which can be implemented by the designed photonic chip to transform real-valued encoding to complex-valued encoding, which has richer information. Compared with real-valued input, the transformation layer can effectively improve the classification accuracy from 93.14% to 97.51% of a 64-dimensional input on the MNIST test set. Finally, we analyze the multicore computation of the PCNN. Compared with the single-core architecture, the multicore architecture can improve the classification accuracy by implementing larger neural networks and has better phase noise robustness. The proposed architecture and algorithms are beneficial to promote the accelerated computing of photonic chips for complex-valued neural networks and are promising for use in many applications, such as image recognition and signal processing.

Journal Article

Share this book

Add to My Shelf

CAL: Core-Aware Lock for the big.LITTLE Multicore Architecture

by Liu, Yingming , Niu, Jie , Nie, Shiqiang in Algorithms , Central processing units , Competition

2024

The concept of “all cores are created equal” has been popular for several decades due to its simplicity and effectiveness in CPU (Central Processing Unit) design. The more cores the CPU has, the higher performance the host owns and the higher the power consumption. However, power-saving is also one of the key goals for servers in data centers and embedded devices (e.g., mobile phones). The big.LITTLE multicore architecture, which contains high-performance cores (namely big core) and power-saved cores (namely little core), has been developed by ARM (Advanced RISC Machine) and Intel to trade off performance and power efficiency. Facing the new heterogeneous computing architecture, the traditional lock algorithms, which are designed to run on homogeneous computing architecture, cannot work optimally as usual and drop into the performance issue for the difference between big core and little core. In our preliminary experiment, we observed that, in the big.LITTLE multicore architecture, all these lock algorithms exhibit sub-optimal performance. The FIFO-based (First In First Out) locks experience throughput degradation, while the performance of competition-based locks can be divided into two categories. One of them is big-core-friendly, so their tail latency increases significantly; the other is little-core-friendly. Not only does the tail latency increase, but the throughput is also degraded. Motivated by this observation, we propose a Core-Aware Lock for the big.LITTLE multicore architecture named CAL, which keeps each core having an equal opportunity to access the critical section in the program. The core idea of the CAL is to take the slowdown ratio as the matric to reorder lock requests of these big and little cores. By evaluating benchmarks and a real-world application named LevelDB, CAL is confirmed to achieve fairness goals in heterogeneous computing architecture without sacrificing the performance of the big core. Compared to several traditional lock algorithms, the CAL’s fairness has increased by up to 67%; and Its throughput is 26% higher than FIFO-based locks and 53% higher than competition-based locks, respectively. In addition, the tail latency of CAL is always kept at a low level.

Journal Article

Share this book

Add to My Shelf

AI based realtime task schedulers for multicore processor based low power biomedical devices for health care application

by Ponnan, Suresh , Prabhaker, M. Lordwin Cecil in 1214: Multimedia Medical Data-driven Decision Making , Artificial intelligence , Bioinformatics

2022

The bioinformatics data processing plays a vital role in low power biomedical devices. The functional domain of processing biological data is collection, execution, conversion, storing and distribution. So, there is an effective multiple objective real time task scheduling technique are required to provide better solution in this domain. This paper describes novel AI based multi-objective evolutionary algorithmic techniques such as multi-objective genetic algorithm (MOGA), non-dominated sorting genetic algorithm (NSGA) and multi-objective messy genetic algorithm (MOMGA) for scheduling real time tasks to a multicore processor-based low power biomedical device used for health care application. These techniques improve the performance upon earlier reported system by considering multiple objectives such as, low power consumption (P), maximizing core utilization (U) and minimizing deadline miss-rate (δ). The novelty of this work is to achieve the schedulability of realtime tasks by computing the converging value of a series of task parameters such as execution time, release time, workload and arrival time. Finally, we investigated the performance parameters such as power consumption (P), deadline miss-rate ( δ ), and core utilization for the given architecture. The evaluation results show that the power consumption is reduced to about 5–8%, utilization of the core is increased about 10% to 40% and deadline miss-rate is comparatively minimized with conventional realtime scheduling approaches.

Journal Article

Share this book

Add to My Shelf

A hybrid crossbar-ring on chip network topology for performance improvement of multicore architectures

by Joshi, Amit D. , Ramasubramanian, N. in Algorithms , Artificial Intelligence , Bandwidths

2023

Multicore architectures have achieved a popularity to deliver improved performance for different application domains. Performance of a system is depending on various factors like instruction set architecture, compiler, types of cores, memory technologies, on chip interconnect topology. The role of computer architect is to propose an application specific design that improves performance of the architecture through different techniques. There are different performance parameters that affects performance of multicore architectures. The most common valued parameters include execution time and energy consumption. All classes of multicore architectures still attempt to provide more energy aware solutions in it. This work proposes a hybrid crossbar-ring on chip network topology. This topology is a hybrid topology that uses a crossbar switch or router and a ring. It divides the network in number of segments. The application environment with more intrasegment communication utilizes this solution in efficient manner as less number nodes are involved in a segment. The focus of the work is to improve the performance in terms of average hop count, packet latency, execution time and energy. The packet latency is improved keeping average hope count constant. The execution time is decreased on average by 6.26% with a maximum decrease by 8.84%. The total energy consumption is reduced by 5.93% on average. The proposed hybrid crossbar-ring on chip network topology outperforms the segmented bus topology by keeping the average hop count in a fixed range.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter