Catalogue Search | MBRL

Efficient digital implementation of a multi-precision square-root algorithm

by Beasley, Alexander E. , Clarke, Christopher T. , Watson, Robert J. in Accuracy , Algorithms , Approximation

2019

In high performance computing systems and signal processing, there is a basic set of mathematical functions that are essential. While addition, subtraction and multiplication are well understood, there is less literature on square-rooting, which is a particularly time- and resource-consuming function. Traditional non-restoring algorithms produce a mantissa half the length of the input mantissa, causing a loss of precision. This study presents a method for increasing the accuracy of this algorithm. It is shown to work for all IEEE-754R standard floating-point numbers. Error analysis shows a 57-fold (for half-precision) and 134e6-fold improvement (for double-precision) in the normalised error, equivalent to at most 1 Units of Least Precision. Resource and performance optimised variants are analysed and their throughput analysed. On an Intel Stratix V device, performance optimised implementations achieve a throughput of 717 MFLOPs. Resource optimised implementations on a low-cost device require only 127 Adaptive Logic Modules and 232 registers, with a throughput of 8.56 MFLOPs. All implementations are DSP block and memory free, saving valuable resources. The maximum throughput of the presented design is 15.5 times greater than that proposed by Pimentel et al. and two orders of magnitude greater than typical multiply-accumulate methods.

Journal Article

Share this book

Add to My Shelf

Numerics of High Performance Computers and Benchmark Evaluation of Distributed Memory Computers

by Singh, K. P. , Krishna, H. S.

2004

The internal representation of numerical data, their speed of manipulation to generate the desired result through efficient utilisation of central processing unit, memory, and communication links are essential steps of all high performance scientific computations. Machine parameters, in particular, reveal accuracy and error bounds of computation, required for performance tuning of codes. This paper reports diagnosis of machine parameters, measurement of computing power of several workstations, serial and parallel computers, and a component-wise test procedure for distributed memory computers. Hierarchical memory structure is illustrated by block copying and unrolling techniques. Locality of reference for cache reuse of data is amply demonstrated by fast Fourier transform codes. Cache and register-blocking technique results in their optimum utilisation with consequent gain in throughput during vector-matrix operations. Implementation of these memory management techniques reduces cache inefficiency loss, which is known to be proportional to the number of processors. Of the two Linux clusters-ANUP16, HPC22 and HPC64, it has been found from the measurement of intrinsic parameters and from application benchmark of multi-block Euler code test run that ANUP16 is suitable for problems that exhibit fine-grained parallelism. The delivered performance of ANUP16 is of immense utility for developing high-end PC clusters like HPC64 and customised parallel computers with added advantage of speed and high degree of parallelism.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter