Catalogue Search | MBRL

High performance computing : programming and applications

by Levesque, John M., author , Wagenbreth, Gene, contributor in High performance computing. , Supercomputers Programming.

Book

Share this book

Add to My Shelf

High Performance Computing

by Levesque, John , Wagenbreth, Gene in High performance computing , Programming , Supercomputers

2010

Covering all three levels of parallelism, this book presents techniques that address performance issues in the programming of HPC applications. Drawing on their experience with chips from AMD and systems, interconnects, and software from Cray Inc., the authors explore the problems that create bottlenecks in attaining good performance. After discussing architectural and software challenges, they outline a strategy for porting and optimizing an existing application to a large MPP system. They also introduce the use of GPGPUs for carrying out HPC computations.

eBook

Share this book

Add to My Shelf

Development of Parallel Methods for a$1024$ -Processor Hypercube

by Benner, Robert E. , Montry, Gary R. , Gustafson, John L. in Algorithms , Applied sciences , Approximation

1988

We have developed highly efficient parallel solutions for three practical, full-scale scientific problems: wave mechanics, fluid dynamics, and structural analysis. Several algorithmic techniques are used to keep communication and serial overhead small as both problem size and number of processors are varied. A new parameter, operation efficiency, is introduced that quantifies the tradeoff between communication and redundant computation. A 1024-processor MIMD ensemble is measured to be 502 to 637 times as fast as a single processor when problem size for the ensemble is fixed, and 1009 to 1020 times as fast as a single processor when problem size per processor is fixed. The latter measure, denoted scaled speedup, is developed and contrasted with the traditional measure of parallel speedup. The scaled-problem paradigm better reveals the capabilities of large ensembles, and permits detection of subtle hardware-induced load imbalances (such as error correction and data-dependent MFLOPS rates) that may become increasingly important as parallel processors increase in node count. Sustained performance for the applications is 70 to 130 MFLOPS, validating the massively parallel ensemble approach as a practical alternative to more conventional processing methods. The techniques presented appear extensible to even higher levels of parallelism than the 1024-processor level explored here.

Journal Article

Share this book

Add to My Shelf

A Note on Downdating the Cholesky Factorization

by van Dooren, P. , Brent, R. P. , de Hoog, F. R. in Algorithms , Error analysis , Exact sciences and technology

1987

We analyse and compare three algorithms for \"downdating\" the Cholesky factorization of a positive definite matrix. Although the algorithms are closely related, their numerical properties differ. Two algorithms are stable in a certain \"mixed\" sense while the other is unstable. In addition to comparing the numerical properties of the algorithms, we compare their computational complexity and their suitability for implementation on parallel or vector computers.

Journal Article

Share this book

Add to My Shelf

A Parallel and Vector Variant of the Cyclic Reduction Algorithm

by Sweet, Roland A. in Algorithms , Approximation , Communication

1988

The Buneman variant of the block cyclic reduction algorithm begins as a highly parallel algorithm, but collapses with each reduction to a very serial one. Using partial fraction expansions of rational matrix functions, it is shown how to regain the parallelism. The resulting algorithm using $n^2 $ processors runs in $O(\\log ^2 n)$ time.

Journal Article

Share this book

Add to My Shelf

A Parallel Triangular Solver for a Distributed-Memory Multiprocessor

by Li, Guangye , Coleman, Thomas F. in Algorithms , Applied mathematics , Exact sciences and technology

1988

We consider solving triangular systems of linear equations on a distributed-memory multiprocessor which allows for a ring embedding. Specifically, we propose a parallel algorithm, applicable when the triangular matrix is distributed by column in a wrap fashion. Numerical experiments indicate that the new algorithm is very efficient in some circumstances (in particular, when the size of the problem is sufficiently large relative to the number of processors). A theoretical analysis confirms that the total running time varies linearly, with respect to the matrix order, up to a threshold value of the matrix order, after which the dependence is quadratic. Moreover, we show that total message traffic is essentially the minimum possible. Finally, we describe an analogous row-oriented algorithm.

Journal Article

Share this book

Add to My Shelf

A Nearly Optimal Parallel Algorithm for Constructing Depth First Spanning Trees in Planar Graphs

by He, Xin , Yesha, Yaacov in Algorithmics. Computability. Computer arithmetics , Algorithms , Applied sciences

1988

This paper presents a parallel algorithm for constructing depth first spanning trees in planar graphs. The algorithm takes $O(\\log ^2 n)$ time with $O(n)$ processors on a concurrent read concurrent write parallel random access machine (PRAM). The best previously known algorithm for the problem takes $O(\\log ^3 n)$ time with $O(n^4 )$ processors on a PRAM. Our algorithm is within an $O(\\log ^2 n)$ factor of optimality.

Journal Article

Share this book

Add to My Shelf

On Maintaining Dynamic Information in a Concurrent Environment

by Manber, Udi in Algorithms , Applied sciences , Computer science; control theory; systems

1986

This paper considers the amount of cooperation required for independent asynchronous processes to share a simple dynamic data structure. We present a scheme for designing efficient concurrent algorithms to add and remove elements from a shared pool of elements. The efficiency is measured mainly by the number of non-local operations that a process may have to make. Non-local operations may involve writing into a shared variable, locking, or sending a message, hence they introduce interference (or require cooperation). We derive upper and lower bounds on the interference in the worst case. Applications to distributed computation are also discussed.

Journal Article

Share this book

Add to My Shelf

Necessary and Sufficient Conditions for the Existence of Local Matrix Decompositions

by Gader, Paul D. in Arrays , Combinatorics , Combinatorics. Ordered structures

1988

Let $D = ( V,E )$ be a directed graph with $n$ vertices. We define the notion of a local matrix with respect to $D$ and we show that every $n \\times n$ matrix, over the real or complex numbers, can be factored into a product of local matrices with respect to $D$ if and only if $D$ is strongly connected and contains all loops. We discuss the significance of this result with respect to parallel computation of linear transforms on SIMD processor arrays. We observe that the result can be used to associate with certain irreducible $n \\times n$ matrices a generating set of the semigroup of all $n \\times n$ matrices under matrix multiplication.

Journal Article

Share this book

Add to My Shelf

A Fully Parallel Algorithm for the Symmetric Eigenvalue Problem

by Sorensen, D. C. , Dongarra, J. J. in Algorithms , Computational mathematics , Eigenvalues

1987

In this paper we present a parallel algorithm for the symmetric algebraic eigenvalue problem. The algorithm is based upon a divide and conquer scheme suggested by Cuppen for computing the eigensystem of a symmetric tridiagonal matrix. We extend this idea to obtain a parallel algorithm that retains a number of active parallel processes that is greater than or equal to the initial number throughout the course of the computation. We give a new deflation technique which together with a robust root finding technique will assure computation of an eigensystem to full accuracy in the residuals and in the orthogonality of eigenvectors. A brief analysis of the numerical properties and sensitivity to round off error is presented to indicate where numerical difficulties may occur. The algorithm is able to exploit parallelism at all levels of the computation and is well suited to a variety of architectures. Computational results are presented for several machines. These results are very encouraging with respect to both accuracy and speedup. A surprising result is that the parallel algorithm, even when run in serial mode, can be significantly faster than the previously best sequential algorithm on large problems, and is effective on moderate size problems when run in serial mode.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter