Catalogue Search | MBRL

The Design and Implementation of TIDeFlow: A Dataflow-Inspired Execution Model for Parallel Loops and Task Pipelining

by Gao, Guang , Arteaga, Jaime , Pavel, Robert in Analysis , Bandwidths , Computer programming

2016

This paper provides an extended description of the design and implementation of the Time Iterated Dependency Flow (TIDeFlow) execution model. TIDeFlow is a dataflow-inspired model that simplifies the scheduling of shared resources on many-core processors. To accomplish this, programs are specified as directed graphs and the dataflow model is extended through the introduction of intrinsic constructs for parallel loops and the arbitrary pipelining of operations. The main contributions of this paper are: (1) a formal description of the TIDeFlow execution model and its programming model, (2) a description of the TIDeFlow implementation and its strengths over previous execution models, such as the ability to natively express parallel loops and task pipelining, (3) an analysis of experimental results showing the advantages of TIDeFlow with respect to expressing parallel programs on many-core architectures and (4) a presentation of the implementation of a low overhead runtime system for TIDeFlow.

Journal Article

Share this book

Add to My Shelf

Language Support for Multi-Paradigm and Multi-Grain Parallelism on Smp-Cluster

by Wang, J. , Li, J. , Hu, C. in Data processing , Distributed shared memory , Efficiency

2007

The characteristics of large-scale parallel applications are multi-paradigm and multi-grain parallel in essence. The key factor in improving the performance of parallel application systems is to determine suitable parallel paradigms and grains according to the nature of the practical problem. Therefore, it is necessary to provide multi-paradigm and multi-grain parallel programming interface for development of large-scale parallel application systems. This paper proposes a multi-paradigm and multi-grain parallel execution model integrated coarse-grain parallelism (paralleled by macro tasks), mid-grain parallelism (paralleled by basic program blocks), and fine-grain parallelism (paralleled in repetition blocks). This model also supports the task parallel, data parallel, and sequential executing. In this paper we also discuss the programming mechanism of this model by extended OpenMP specification. The extensions include computing resource partition, defining different grain task groups, mapping from task groups to the respective processor groups, out-of-core computing, asynchronous parallel I/O , and definition of sequential relationship of tasks. We compare the performance of different implementations of benchmark, using the same numerical algorithm but employing different programming approaches, including MPI, MPI+OpenMP, and our extended OpenMP. We also discuss a case based on SMP-Cluster and network storage architecture.

Journal Article

Share this book

Add to My Shelf

Amdahl's law in the context of heterogeneous many‐core systems – a survey

by Xia, Fei , Al‐hayanni, Mohammed A. Noaman , Rafiev, Ashur in Communication , Energy consumption , Energy efficiency

2020

For over 50 years, Amdahl's Law has been the hallmark model for reasoning about performance bounds for homogeneous parallel computing resources. As heterogeneous, many‐core parallel resources continue to permeate into the modern server and embedded domains, there has been growing interest in promulgating realistic extensions and assumptions in keeping with newer use cases. This study aims to provide a comprehensive review of the purviews and insights provided by the extensive body of work related to Amdahl's law to date, focusing on computation speedup. The authors show that a significant portion of these studies has looked into analysing the scalability of the model considering both workload and system heterogeneity in real‐world applications. The focus has been to improve the definition and semantic power of the two key parameters in the original model: the parallel fraction (f) and the computation capability improvement index (n). More recently, researchers have shown normal‐form and multi‐fraction extensions that can account for wider ranges of heterogeneity, validated on many‐core systems running realistic workloads. Speedup models from Amdahl's law onwards have seen a wide range of uses, such as the optimisation of system execution, and these uses are even more important with the advent of the heterogeneous many‐core era.

Journal Article

Share this book

Add to My Shelf

On Generalizing Divide and Conquer Parallel Programming Pattern

by Niculescu, Virginia in Algorithms , Decomposition , divide-and-conquer

2022

(1) Background: Structuring is important in parallel programming in order to master its complexity, and this structuring could be achieved through programming patterns and skeletons. Divide-and-conquer computation is essentially defined by a recurrence relation that links the solution of a problem to the solutions of subproblems of the same type, but of smaller sizes. This pattern allows the specification of different types of computations, and so it is important to provide a general specification that comprises all its cases. We intend to prove that the divide-and-conquer pattern could be generalized such that to comprise many of the other parallel programming patterns, and in order to prove this, we provide a general formulation of it. (2) Methods: Starting from the proposed generalized specification of the divide-and-conquer pattern, the computation of the pattern is analyzed based on its stages: decomposition, base-case and composition. Examples are provided, and different execution models are analyzed. (3) Results: a general functional specification is provided for a divide-and-conquer pattern and based on it, and we prove that this general formulation could be specialized through parameters’ instantiating into other classical parallel programming patterns. Based on the specific stages of the divide-and-conquer, three classes of computations are emphasized. In this context, an equivalent efficient bottom-up computation is formally proved. Associated models of executions are emphasized and analyzed based on the three classes of divide-and-conquer computations. (4) Conclusion: A more general definition of the divide-and-conquer pattern is provided, and this includes an arity list for different decomposition degrees, a level of recursion, and also an alternative solution for the cases that are not trivial but allow other approaches (sequential or parallel) that could lead to better performance. Together with the associated analysis of patterns equivalence and optimized execution models, this provides a general formulation that is useful both at the semantic level and implementation level.

Journal Article

Share this book

Add to My Shelf

Locality‐protected cache allocation scheme with low overhead on GPUs

by Zhang, Yang , Xing, Zuocheng , Tang, Chuan in Evictions , Random access memory

2018

Graphics processing units (GPUs) are playing more important roles in parallel computing. Using their multi‐threaded execution model, GPUs can accelerate many parallel programmes and save energy. In contrast to their strong computing power, GPUs have limited on‐chip memory space which is easy to be inadequate. The throughput‐oriented execution model in GPU introduces thousands of hardware threads, which may access the small cache simultaneously. This will cause cache thrashing and contention problems and limit GPU performance. Motivated by these issues, the authors put forward a locality‐protected method based on instruction programme counter (LPC) to make use of data locality in L1 data cache with very low hardware overhead. First, they use a simple Program Counter (PC)‐based locality detector to collect reuse information of each cache line. Then, a hardware‐efficient prioritised cache allocation unit is proposed to coordinate data reuse information with time‐stamp information to predict the reuse possibility of each cache line, and to evict the line with the least reuse possibility. Their experiment on the simulator shows that LPC provides an up to 17.8% speedup and an average of 5.0% improvement over the baseline method with very low overhead.

Journal Article

Share this book

Add to My Shelf

Verifying Parallel Code After Refactoring Using Equivalence Checking

by Pidan, Dmitry , Veksler, Tatyana , Abadi, Moria in Computer programming , Equivalence , Parallel processing

2019

To take advantage of multi-core systems, programmers are replacing sequential software with parallel software. Software engineers often avoid writing their parallel software from scratch and prefer refactoring their legacy application, either manually or with the help of a refactoring tool. In either case, it is extremely challenging to produce correct parallel code, taking into account all synchronization issues. Furthermore, the complexity of parallel code makes its verification extremely difficult. We introduce a method for the verification of parallel code after refactoring. Our method, which is based on symbolic interpretation, leverages the original sequential code that in most cases was already tested and/or verified, and checks whether it is equivalent to the code after refactoring. The advantage of this method is that it can generically find any problem in the parallel code that does not exist in the original sequential code. As a result, it can help create higher quality and safer parallel code.

Journal Article

Share this book

Add to My Shelf

Principles to Support Modular Software Construction

by Jack B. Dennis in Artificial Intelligence , Computer architecture , Computer engineering

2017

The construction of large software systems is always achieved through assembly of independently written components -- program modules. For these software components to work together, they must share a common set of data types and principles for representing structured data such as arrays of values and files. This common set of tools for creating and operating on data objects is provided by the infrastructure of the computer system： the hardware, operating system and runtime code. Because the nature and properties of these tools are crucial for correct operation of software components and their inter-operation, it is essential to have a precise specification that may be used for verifying correctness of application software on one hand, and to verify correctness of system behavior on the other. We call such a specification a program execution model （PXM）. It is evident that the properties of the PXM implemented by a computer system can have serious impact on the ability of application programmers to practice modular software construction. This paper discusses the concept of program execution models and presents a set of principles that a PXM must satisfy to provide a sound basis for modular software construction. Because parallel program execution on computer systems with many processing units is an essential part of contemporary computing environments, the expression of parallelism and modular software construction using components involving parallel operations is included in this treatment. The conclusion is that it is possible to build computer systems that implement a PXM within which any parallel program may be used, unmodified, as a component for building more substantial parallel programs.

Journal Article

Share this book

Add to My Shelf

Context-Aware Prediction Model for Offloading Mobile Application Tasks to Mobile Cloud Environments

by Jadad, Hamid A , Arafeh, Bassel , Touzene, Abederezak in Ambient intelligence , Applications programs , Batteries

2019

Offloading intensive computation parts of the mobile application code to the cloud computing is a promising way to enhance the performance of the mobile device and save the battery consumption. Recent works on mobile cloud computing mainly focus on making a decision of which parts of application may be executed remotely, assuming that mobile and server processors have no other loads, mobile battery always full of charge, and have static network bandwidth. However, the mobile cloud environment parameters changes continuously. In this paper, the authors propose a new offloading approach which uses cost models to decide at runtime either to offload execution of the code to the remote cloud or not. This article considers the dynamic changes of the mobile cloud environment in the system cost models. Moreover, this article enhances the offloading process by considering parallel execution of application independent tasks in the cloud. The evaluation results show that the approach reduces the execution time and battery consumption by 75% and 55%, respectively, compared with existing offloading approaches.

Journal Article

Share this book

Add to My Shelf

Optimized OpenCL™ kernels for frequency domain image high-boost filters using image vectorization technique

by Livingston, L. M. Jenila , Satapathy, Ashutosh in Algorithms , Applied and Technical Physics , Chemistry/Food Science

2019

Image high boost filtering uses high-boost filters to enhance the quality of an image, which has also seen in remote sensing, satellite broadcasting, classroom monitoring, and many more real-time video processing applications and requires its faster implementation. OpenCL is a widely adapted parallel programming framework that provides core level data parallelism, and dedicated for heterogeneous parallel devices like from low cost DSP to high-end CPU, GPU and FPGA. In this article, we have considered mostly used Ideal, Gaussian, Butterworth, and Laplacian of Gaussian frequency domain high-boost filters and implemented channelized OpenCL kernels for their rapid execution. In addition to that, these kernels are modified using image vectorization technique to optimize their time utilization by reducing the execution time of these OpenCL kernels to half. At last, performance analysis is carried out for these two types of OpenCL kernel implementations to determine their effectiveness with regard to time consumption and accuracy. Here, different image performance evaluation metrics like entropy, standard deviation, mean absolute error, percentage fit error, SSIM, correlation, and peak signal to noise ratio are applied to measure rightness of the above high-boost filters. From the results, we have concluded that a vectorized Butterworth high-boost filter kernel is the suitable one to provide better results among those filters, which might be highly adaptable in time bound real-time applications using various embedded devices.

Journal Article

Share this book

Add to My Shelf

Quality risk prediction at a non-sampling station machine in a multi-product, multi-stage, parallel processing manufacturing system subjected to sequence disorder and multiple stream effects

by Rotondo, Anna , Geraghty, John , Young, Paul in Business and Management , Combinatorics , Costs

2013

Quality risks determined by inspection economies represent a difficult controllable variable in complex manufacturing environments. Planning a quality strategy without being able to predict its effectiveness in all the stations of a system might eventually lead to a loss of time, money and resources. The use of one station to regularly select the samples for a production segment introduces relevant complexities in the analysis of the available quality measurements when they are referred to the other stations in that segment. The multiple streams of product through the parallel machines of the stations and the cycle time randomness, responsible for variation of the item sequence order at each production step, nullify the regularity of the sampling patterns at the machines of the non-sampling stations. This work develops a fundamental model which supports the prediction of the ‘quality risk’, at a given machine in the non-sampling stations, associated with a particular sampling policy for a multi-product, multi-stage, parallel processing manufacturing system subjected to sequence disorder and multiple stream effects. The rationale on which the model is based and successful applications of the model, to scenarios structurally different from those used for its development, give confidence in the general validity of the model here proposed for the quality risk prediction at non-sampling station machines.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter