Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
10,501
result(s) for
"parallel computing"
Sort by:
Automated prioritizing heuristics for parallel task graph scheduling in heterogeneous computing
2022
High-performance computing (HPC) relies increasingly on heterogeneous hardware and especially on the combination of central and graphical processing units. The task-based method has demonstrated promising potential for parallelizing applications on such computing nodes. With this approach, the scheduling strategy becomes a critical layer that describes where and when the ready-tasks should be executed among the processing units. In this study, we describe a heuristic-based approach that assigns priorities to each task type. We rely on a fitness score for each task/worker combination for generating priorities and use these for configuring the Heteroprio scheduler automatically within the StarPU runtime system. We evaluate our method’s theoretical performance on emulated executions and its real-case performance on multiple different HPC applications. We show that our approach is usually equivalent or faster than expert-defined priorities.
Journal Article
Deploying and Optimizing Embodied Simulations of Large-Scale Spiking Neural Networks on HPC Infrastructure
by
Yamaura, Hiroshi
,
Albanese, Ugo
,
Retamino, Eloy
in
Brain architecture
,
Brain research
,
Cognitive ability
2022
Simulating the brain-body-environment trinity in closed loop is an attractive proposal to investigate how perception, motor activity and interactions with the environment shape brain activity, and vice versa. The relevance of this embodied approach, however, hinges entirely on the modeled complexity of the various simulated phenomena. In this article, we introduce a software framework that is capable of simulating large-scale, biologically realistic networks of spiking neurons embodied in a biomechanically accurate musculoskeletal system that interacts with a physically realistic virtual environment. We deploy this framework on the high performance computing resources of the EBRAINS research infrastructure and we investigate the scaling performance by distributing computation across an increasing number of interconnected compute nodes. Our architecture is based on requested compute nodes as well as persistent virtual machines; this provides a high-performance simulation environment that is accessible to multi-domain users without expert knowledge, with a view to enable users to instantiate and control simulations at custom scale via a web-based Graphical User Interface. Our simulation environment, entirely open source, is based on the Neurorobotics Platform developed in the context of the Human Brain Project, and the NEST simulator. We characterize the capabilities of our parallelized architecture for large-scale embodied brain simulations through two benchmark experiments, by investigating the effects of scaling compute resources on performance defined in terms of experiment runtime, brain instantiation and simulation time. The first benchmark is based on a large-scale balanced network, while the second one is a multi-region embodied brain simulation consisting of more than a million neurons and a billion synapses. Both benchmarks clearly show how scaling compute resources improve the aforementioned performance metrics in a near-linear fashion. The second benchmark in particular is indicative of both the potential and limitations of a highly distributed simulation in terms of a trade-off between computation speed and resource cost. Our simulation architecture is being prepared to be accessible for everyone as an EBRAINS service, thereby offering a community-wide tool with a unique workflow that should provide momentum to the investigation of closed-loop embodiment within the computational neuroscience community.
Journal Article
Segmented Merge: A New Primitive for Parallel Sparse Matrix Computations
by
Ji Haonan
,
Liu, Weifeng
,
Hou Kaixi
in
Algorithms
,
Massively parallel processors
,
Multiplication
2021
Segmented operations, such as segmented sum, segmented scan and segmented sort, are important building blocks for parallel irregular algorithms. We in this work propose a new parallel primitive called segmented merge. Its function is in parallel merging q sub-segments to p segments, both of possibly nonuniform lengths which easily cause the load balancing and the vectorization problems on massively parallel processors, such as GPUs. Our algorithm resolves these problems by first recording the boundaries of segments and sub-segments, then assigning roughly the same number of elements for GPU threads, and finally iteratively merging the sub-segments in each segment in the form of binary tree until there is only one sub-segment in each segment. We implement the segmented merge primitive on GPUs and demonstrate its efficiency on parallel sparse matrix transposition (SpTRANS) and sparse matrix–matrix multiplication (SpGEMM) operations. We conduct a comparative experiment with NVIDIA vendor library on two GPUs. The experimental results show that our algorithm achieve on average 3.94× (up to 13.09×) and 2.89× (up to 109.15×) speedup on SpTRANS and SpGEMM, respectively.
Journal Article
Performance evaluation of the inverse real-valued fast Fourier transform on field programmable gate array platforms using open computing language
by
Yang, Sida
,
Qian, Zhuo
,
Liu, Li
in
Algorithms
,
Central processing units
,
Comparative analysis
2025
The real-valued fast Fourier transform (RFFT) is well-suited for high-speed, low-power FFT processors, as it requires approximately half the arithmetic operations compared to the traditional complex-valued FFT (CFFT). While RFFT can be computed using CFFT hardware, a dedicated RFFT implementation offers advantages such as lower hardware complexity, reduced power consumption, and higher throughput. However, unlike CFFT, the irregular signal flow graph of RFFT presents challenges in designing efficient pipelined architectures. In our previous work, we have proposed a high-level programming approach using Open Computing Language (OpenCL) to implement the forward RFFT architectures on Field-Programmable Gate Arrays (FPGAs). In this article, we propose a high-level programming approach to implement the inverse RFFT architectures on FPGAs. By identifying regular computational patterns in the inverse RFFT flow graph, our method efficiently expresses the algorithm using a for loop, which is later fully unrolled using high-level synthesis tools to automatically generate a pipelined architecture. Experiments show that for a 4,096-point inverse RFFT, the proposed method achieves a 2.36x speedup and 2.92x better energy efficiency over CUDA FFT (CUFFT) on Graphics Processing Units (GPUs), and a 24.91x speedup and 18.98x better energy efficiency over Fastest Fourier Transform in the West (FFTW) on Central Processing Units (CPUs) respectively. Compared to Intel’s CFFT design on the same FPGA, the proposed one reduces 9% logic resources while achieving a 1.39x speedup. These results highlight the effectiveness of our approach in optimizing RFFT performance on FPGA platforms.
Journal Article
Predictive Simulation for Surface Fault Occurrence Using High-Performance Computing
by
Sawada, Masataka
,
Haba, Kazumoto
,
Hori, Muneo
in
Boundary value problems
,
Earthquakes
,
fault displacement
2022
Numerical simulations based on continuum mechanics are promising methods for the estimation of surface fault displacements. We developed a parallel finite element method program to perform such simulations and applied the program to reproduce the 2016 Kumamoto earthquake, where surface rupture was observed. We constructed an analysis model of the 5 × 5 × 1 km domain, including primary and secondary faults, and inputted the slip distribution of the primary fault, which was obtained through inversion analysis and the elastic theory of dislocation. The simulated slips on the surface were in good agreement with the observations. We then conducted a predictive simulation by inputting the slip distributions of the primary fault, which were determined using a strong ground motion prediction method for an earthquake with a specified source fault. In this simulation, no surface slip was induced in the sub-faults. A large surface slip area must be established near a sub-fault to induce the occurrence of a slip on the surface.
Journal Article
Setting Up and Implementation of the Parallel Computing Cluster in Higher Education
by
Serik, Meruert
,
Karelkhan, Nursaule
,
Kultan, Jaroslav
in
Comparative Analysis
,
Education
,
Information Sources
2019
In this article, we describe in detail the setting up and implementation of the parallel computing cluster for education in the Matlab environment and how we solved the problems arising on this way. We also describe the comparative analysis of parallel computing cluster by the example of matrix multiplication by a vector with large dimensions. First calculations were performed on one computer, and then on a parallel computing cluster. In the experiment, we proved the effectiveness of parallel computing and the necessity of the setting up of the parallel computing cluster. We hope that the creation of a parallel computing cluster for education will help in teaching the subject of parallel computing at higher schools that do not have sufficient hardware resources. This paper presents unique setting up and implementation of the parallel computing cluster for teaching and learning of the parallel computing course and a wide variety of information sources from which instructors can choose.
Journal Article
Enhanced Serpent algorithm using Lorenz 96 Chaos-based block key generation and parallel computing for RGB image encryption
by
Elshoush, Huwaida T.
,
Al-Tayeb, Banan M.
,
Obeid, Khalil T.
in
Algorithms
,
Algorithms and Analysis of Algorithms
,
Analysis
2021
This paper presents a new approach to enhance the security and performance of the Serpent algorithm. The main concepts of this approach is to generate a sub key for each block using Lorenz 96 chaos and then run the process of encryption and decryption in ECB parallel mode. The proposed method has been implemented in Java, openjdk version “11.0.11”; and for the analysis of the tested RGB images, Python 3.6 was used. Comprehensive experiments on widely used metrics demonstrate the effectiveness of the proposed method against differential attacks, brute force attacks and statistical attacks, while achieving superb results compared to related schemes. Moreover, the encryption quality, Shannon entropy, correlation coefficients, histogram analysis and differential analysis all accomplished affirmative results. Furthermore, the reduction in encryption/decryption time was over 61%. Moreover, the proposed method cipher was tested using the Statistical Test Suite (STS) recommended by the NIST and passed them all ensuring the randomness of the cipher output. Thus, the approach demonstrated the potential of the improved Serpent-ECB algorithm with Lorenz 96 chaos-based block key generation (BKG) and gave favorable results. Specifically, compared to existing encryption schemes, it proclaimed its effectiveness.
Journal Article
Locality-aware task scheduling for homogeneous parallel computing systems
by
Amin, Sarah
,
Popov, Konstantin
,
Bhatti, Muhammad Khurram
in
Algorithms
,
Energy consumption
,
Graphs
2018
In systems with complex many-core cache hierarchy, exploiting data locality can significantly reduce execution time and energy consumption of parallel applications. Locality can be exploited at various hardware and software layers. For instance, by implementing private and shared caches in a multi-level fashion, recent hardware designs are already optimised for locality. However, this would all be useless if the software scheduling does not cast the execution in a manner that promotes locality available in the programs themselves. Since programs for parallel systems consist of tasks executed simultaneously, task scheduling becomes crucial for the performance in multi-level cache architectures. This paper presents a heuristic algorithm for homogeneous multi-core systems called locality-aware task scheduling (LeTS). The LeTS heuristic is a work-conserving algorithm that takes into account both locality and load balancing in order to reduce the execution time of target applications. The working principle of LeTS is based on two distinctive phases, namely; working task group formation phase (WTG-FP) and working task group ordering phase (WTG-OP). The WTG-FP forms groups of tasks in order to capture data reuse across tasks while the WTG-OP determines an optimal order of execution for task groups that minimizes the reuse distance of shared data between tasks. We have performed experiments using randomly generated task graphs by varying three major performance parameters, namely: (1) communication to computation ratio (CCR) between 0.1 and 1.0, (2) application size, i.e., task graphs comprising of 50-, 100-, and 300-tasks per graph, and (3) number of cores with 2-, 4-, 8-, and 16-cores execution scenarios. We have also performed experiments using selected real-world applications. The LeTS heuristic reduces overall execution time of applications by exploiting inter-task data locality. Results show that LeTS outperforms state-of-the-art algorithms in amortizing inter-task communication cost.
Journal Article
Randomized Progressive Hedging methods for multi-stage stochastic programming
by
Malick Jérôme
,
Iutzeler Franck
,
Bareilles Gilles
in
Algorithms
,
Fixed points (mathematics)
,
Iterative methods
2020
Progressive Hedging is a popular decomposition algorithm for solving multi-stage stochastic optimization problems. A computational bottleneck of this algorithm is that all scenario subproblems have to be solved at each iteration. In this paper, we introduce randomized versions of the Progressive Hedging algorithm able to produce new iterates as soon as a single scenario subproblem is solved. Building on the relation between Progressive Hedging and monotone operators, we leverage recent results on randomized fixed point methods to derive and analyze the proposed methods. Finally, we release the corresponding code as an easy-to-use Julia toolbox and report computational experiments showing the practical interest of randomized algorithms, notably in a parallel context. Throughout the paper, we pay a special attention to presentation, stressing main ideas, avoiding extra-technicalities, in order to make the randomized methods accessible to a broad audience in the Operations Research community.
Journal Article
A reformed task scheduling algorithm for heterogeneous distributed systems with energy consumption constraints
by
Hu, Yikun
,
Li, Jinghong
,
He, Ligang
in
Advances in Parallel and Distributed Computing for Neural Computing
,
Algorithms
,
Artificial Intelligence
2020
As the scale increases and performance improves, the energy consumption of high-performance computer systems is rapidly increasing. The energy-aware task scheduling for high-performance computer systems has become a hot spot for major supercomputing centers and data centers. In this paper, we study the task scheduling problem to minimize the schedule length of parallel applications while satisfying the energy constraints in heterogeneous distributed systems. The existing approaches mainly allocate unassigned tasks with minimal energy consumption which cannot achieve optimistic scheduling length in most cases. Based on this situation, we propose a reformed scheduling method with energy consumption constraint algorithm, which is based on an energy consumption level to pre-allocate energy consumption for unassigned tasks. The experimental results show that compared with the existing algorithms, our new algorithm can achieve better scheduling length under the energy consumption constraints.
Journal Article