Catalogue Search | MBRL

Parallel programming for modern high performance computing systems

by Czarnul, Pawel, author in Parallel programs (Computer programs) , Parallel algorithms.

Book

A parallelizable method for two-dimensional wave propagation using subdomains in time with Multigrid and Waveform Relaxation

by Malacarne, Maicon Felipe , Franco, Sebastião Romero , Pinto, Márcio Augusto Villela in Algorithms , Decomposition , Finite difference method

2025

In this paper we compare the implicit schemes for the solution of the two-dimensional wave equation using Singlegrid and Multigrid methods. The discretization is performed using the Finite Difference Method, weighted in time by an established parameter. The parallelization of the algorithms is ensured by employing the Waveform Relaxation method, where numerical stability is achieved by applying the method of subdomains in time. The primary innovation of this work lies in the development of a high-order method that harnesses the parallelizability and robustness of the Multigrid method, enabling efficient solutions to the 2D wave equation. These methods also effectively mitigate oscillations that would otherwise significantly increase the maximum residual, a concern arising from the application of the standard Waveform Relaxation method.

Journal Article

Share this book

Add to My Shelf

Parallel Optimization of Program Instructions Using Genetic Algorithms

by Anghelescu, Petre in Crossovers , Evolutionary algorithms , Genetic algorithms

2021

This paper describes an efficient solution to parallelize software program instructions, regardless of the programming language in which they are written. We solve the problem of the optimal distribution of a set of instructions on available processors. We propose a genetic algorithm to parallelize computations, using evolution to search the solution space. The stages of our proposed genetic algorithm are: The choice of the initial population and its representation in chromosomes, the crossover, and the mutation operations customized to the problem being dealt with. In this paper, genetic algorithms are applied to the entire search space of the parallelization of the program instructions problem. This problem is NP-complete, so there are no polynomial algorithms that can scan the solution space and solve the problem. The genetic algorithm-based method is general and it is simple and efficient to implement because it can be scaled to a larger or smaller number of instructions that must be parallelized. The parallelization technique proposed in this paper was developed in the C# programming language, and our results confirm the effectiveness of our parallelization method. Experimental results obtained and presented for different working scenarios confirm the theoretical results, and they provide insight on how to improve the exploration of a search space that is too large to be searched exhaustively.

Journal Article

Share this book

Add to My Shelf

Efficient simulation of neural development using shared memory parallelization

by De Schutter, Erik in growth , Linux , Memory

2023

The Neural Development Simulator, NeuroDevSim, is a Python module that simulates the most important aspects of brain development: morphological growth, migration, and pruning. It uses an agent-based modeling approach inherited from the NeuroMaC software. Each cycle has agents called fronts execute model-specific code. In the case of a growing dendritic or axonal front, this will be a choice between extension, branching, or growth termination. Somatic fronts can migrate to new positions and any front can be retracted to prune parts of neurons. Collision detection prevents new or migrating fronts from overlapping with existing ones. NeuroDevSim is a multi-core program that uses an innovative shared memory approach to achieve parallel processing without messaging. We demonstrate linear strong parallel scaling up to 96 cores for large models and have run these successfully on 128 cores. Most of the shared memory parallelism is achieved without memory locking. Instead, cores have only write privileges to private sections of arrays, while being able to read the entire shared array. Memory conflicts are avoided by a coding rule that allows only active fronts to use methods that need writing access. The exception is collision detection, which is needed to avoid the growth of physically overlapping structures. For collision detection, a memory-locking mechanism was necessary to control access to grid points that register the location of nearby fronts. A custom approach using a serialized lock broker was able to manage both read and write locking. NeuroDevSim allows easy modeling of most aspects of neural development for models simulating a few complex or thousands of simple neurons or a mixture of both.

Journal Article

Share this book

Add to My Shelf

Performance evaluation of GPU-based parallel sorting algorithms

by Baubek, Baizhan , Algarni, Abdulmohsen , Tolendi, Nurdaulet in Algorithms , Analysis , Communication

2026

Sorting can be approached in two main ways: sequentially and in parallel. In sequential sorting, data is processed in a single-threaded manner, which can be slow for large datasets. However, parallel sorting divides the task across multiple processing units, enabling faster results by processing data simultaneously. Furthermore, Compute Unified Device Architecture (CUDA) technology enables developers to leverage GPU power for general-purpose parallel computing, significantly accelerating tasks like sorting. This paper investigates the GPU-based parallelization of merge sort (MS), quick sort (QS), bubble sort (BS), radix top-k selection sort (RS), and slow sort (SS) presenting optimized algorithms designed for efficient sorting of large datasets using modern GPUs. The primary objective is to evaluate the performance of these algorithms on GPUs utilizing CUDA, with a focus on analyzing both parallel time complexity and space complexity across various data types. Experiments are conducted on four dataset scenarios: randomly generated data, reverse-sorted data, already-sorted data, and nearly-sorted data. Also, the performance of GPU-accelerated implementations is compared with their sequential counterparts to assess improvements in computational efficiency and scalability. Earlier GPU-based generations of this type typically achieved acceleration rates between 2× and 9× over scalar CPU code. With newer GPU enhancements, including parallel-aware primitives and radix- or merge-optimized operations, acceleration rates have seen significant improvement. Our experiments indicate that Radix Sort based on GPUs achieves a significant speedup of approximately 50× (sequential: 240.8 ms, parallel: 4.83 ms) on 10 million random sort elements. Quick Sort and Merge Sort have 97× and 103× speedups, respectively (Quick: 1461.97 ms vs. 15.1 ms; Merge: 2212.33 ms vs. 21.4 ms). Bubble Sort, while significantly improving in parallel (123,321.9 ms to 7377.8 ms for an ≈17× improvement), is considerably worse overall. Slow Sort demonstrates a moderate but consistent acceleration, reducing execution time from 74.07 ms in the sequential version to 3.99 ms on the GPU, yielding an ≈18.6× speedup. These experimental findings confirm that the new single-GPU implementations can get speedups ranging from 17× to over 100×, surpassing the typical gains reported in previous generations and comparable to or over rates of acceleration reported for cutting-edge parallel sorting algorithms in recent studies.

Journal Article

Share this book

Add to My Shelf

GPU parallel acceleration of transient simulations of open channel and pipe combined flows

by Meng, W W , Wu, J Y , Cheng, Y G in Central processing units , Computational efficiency , Computer applications

2019

Simulating the transient processes in complex water transmission system is time-consuming, and improving computational efficiency by means of parallelization on CPU clusters or even faster GPU platform is demanded. This paper proposes an approach to accelerate the transient simulations of open channel and pipe combined flows on single GPU chip. The Saint-Venant equations for open channel flows is solved by using the method of characteristics (MOC), whose inherent parallelism can be well exploited by GPU implementations in the thread-level parallelism structure of Compute Unified Device Architecture (CUDA). The sub-processes, including open channel computation, pipe flow computation and connecting boundary treatment, are implemented by different kernels. The procedures are first verified by analyzing the parallel computation efficiency of hydraulic transient processes in an open channel. Then the transient processes of a practical engineering project, which involves both open channel flow and pressurized pipe flow, are simulated. The GPU kernels are found to be memory bandwidth bounded, and the proposed single chip GPU parallel can achieve up to hundreds of speedup ratios compared to the sequential counterpart on single CPU chip.

Journal Article

Share this book

Add to My Shelf

Graph partitioning and graph clustering : 10th DIMACS Implementation Challenge Workshop, February 13-14, 2012, Georgia Institute of Technology, Atlanta, GA

by Bader, David A. , DIMACS Implementation Challenge Workshop in Combinatorics -- Graph theory -- Graph algorithms. msc , Combinatorics -- Graph theory -- Graphs and linear algebra (matrices, eigenvalues, etc.). msc , Combinatorics -- Graph theory -- Hypergraphs. msc

2013

Graph partitioning and graph clustering are ubiquitous subtasks in many applications where graphs play an important role. Generally speaking, both techniques aim at the identification of vertex subsets with many internal and few external edges. To name only a few, problems addressed by graph partitioning and graph clustering algorithms are: li>What are the communities within an (online) social network?How do I speed up a numerical simulation by mapping it efficiently onto a parallel computer?How must components be organised on a computer chip such that they can communicate efficiently with each other?What are the segments of a digital image?Which functions are certain genes (most likely) responsible for?The 10th DIMACS Implementation Challenge Workshop was devoted to determining realistic performance of algorithms where worst case analysis is overly pessimistic and probabilistic models are too unrealistic. Articles in the volume describe and analyse various experimental data with the goal of getting insight into realistic algorithm performance in situations where analysis fails. This book is published in cooperation with the Center for Discrete Mathematics and Theoretical Computer Science.

eBook

Share this book

Add to My Shelf

Parallel scientific computing

by Magoules, F , Roux, Francois-Xavier , Houzeaux, G in COMPUTERS , Industrial applications , Industrial engineering

2016,2015

Parallel Scientific Computing Scientific computing has become an indispensable tool in numerous fields, such as physics, mechanics, biology, finance and industry. For example, it enables us, thanks to efficient algorithms adapted to current computers, to simulate, without the help of models or experimentations, the deflection of beams in bending, the sound level in a theater room or a fluid flowing around an aircraft wing. This book presents the scientific computing techniques applied to parallel computing for the numerical simulation of large-scale problems; these problems result from systems modeled by partial differential equations. Computing concepts will be tackled via examples. Implementation and programming techniques resulting from the finite element method will be presented for direct solvers, iterative solvers and domain decomposition methods, along with an introduction to MPI and OpenMP.

eBook

Share this book

Add to My Shelf

Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions

by Halko, N. , Martinsson, P. G. , Tropp, J. A. in Algorithmics. Computability. Computer arithmetics , Algorithms , Applied sciences

2011

Low-rank matrix approximations, such as the truncated singular value decomposition and the rank-revealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets. This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed—either explicitly or implicitly—to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired low-rank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, robustness, and/or speed. These claims are supported by extensive numerical experiments and a detailed error analysis. The specific benefits of randomized techniques depend on the computational environment. Consider the model problem of finding the k dominant components of the singular value decomposition of an m × n matrix. (i) For a dense input matrix, randomized algorithms require O(mn log(k)) floating-point operations (flops) in contrast to O(mnk) for classical algorithms. (ii) For a sparse input matrix, the flop count matches classical Krylov subspace methods, but the randomized approach is more robust and can easily be reorganized to exploit multi-processor architectures. (iii) For a matrix that is too large to fit in fast memory, the randomized techniques require only a constant number of passes over the data, as opposed to O(k) passes for classical algorithms. In fact, it is sometimes possible to perform matrix approximation with a single pass over the data.

Journal Article

Share this book

Add to My Shelf

Implementation of Parallel Algorithm Technology for Time Series Data Mining

by Du, Yao , Wang, Mingye , Hu, Xiaohui in Artificial Intelligence , Data Mining , Parallel Algorithms

2021

With the rapid development of computer technology, Internet technology and artificial intelligence technology, the amount of global data has exploded. However, the single-machine serial mode of traditional data mining cannot be directly transplanted to the cloud platform. Only by parallelizing and improving many classic data mining algorithms can the cloud computing platform and data mining be effectively combined. Therefore, it is of great significance to the research and implementation of parallel algorithm technology for time series data mining. The purpose of this paper is to study the research and implementation of parallel algorithm technology for time series data mining. This paper adopts the method of literature data, mathematical statistics, logic analysis and other research methods to study the parallel algorithm technology research and realization of time series data mining, mainly to make useful explorations of time series data mining and visualization technology. It embodies the design ideas of big data analysis tools, and finally reflects the power and market value of data analysis tools through the display of the platform. Research shows that running in the same data set and the same experimental environment, the improved parallel collaborative filtering algorithm ACF in this paper has higher time running efficiency than the parallel algorithm MCF based on the cooccurrence matrix, and in the case of larger data sets, the more obvious the time difference.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter