Catalogue Search | MBRL

by Swofford, David L. , Huelsenbeck, John P. , Baele, Guy in Application programming interface , Benchmarks , Classification - methods

2019

BEAGLE is a high-performance likelihood-calculation library for phylogenetic inference. The BEAGLE library defines a simple, but flexible, application programming interface (API), and includes a collection of efficient implementations for calculation under a variety of evolutionary models on different hardware devices. The library has been integrated into recent versions of popular phylogenetics software packages including BEAST and MrBayes and has been widely used across a diverse range of evolutionary studies. Here, we present BEAGLE 3 with new parallel implementations, increased performance for challenging data sets, improved scalability, and better usability. We have added new OpenCL and central processing unit-threaded implementations to the library, allowing the effective utilization of a wider range of modern hardware. Further, we have extended the API and library to support concurrent computation of independent partial likelihood arrays, for increased performance of nucleotide-model analyses with greater flexibility of data partitioning. For better scalability and usability, we have improved how phylogenetic software packages use BEAGLE in multi-GPU (graphics processing unit) and cluster environments, and introduced an automated method to select the fastest device given the data set, evolutionary model, and hardware. For application developers who wish to integrate the library, we also have developed an online tutorial. To evaluate the effect of the improvements, we ran a variety of benchmarks on state-of-the-art hardware. For a partitioned exemplar analysis, we observe run-time performance improvements as high as 5.9-fold over our previous GPU implementation. BEAGLE 3 is free, open-source software licensed under the Lesser GPL and available at https://beagle-dev.github.io.

Journal Article

Share this book

Add to My Shelf

A High Performance Parallel Approach to Delay Sum Beamformer in a Homogeneous Multicore Environment

by Rajesh, R. , Gopal, G. Vijay , Narayanan, M. Prakash in Algorithms , Arithmetic , Arrays

2024

A Cache-Aware Beamformer (CABF) algorithm for the DAS beamformer in a homogeneous multicore processor environment is presented. The context of the proposed algorithm is established by discussing the case for a refined multicore implementation of the beamformer algorithm for a sonar application. The algorithm is designed, implemented, and compared to a regular pthread multicore implementation and a standard OpenMP-based implementation, using arithmetic intensity as the metric. FMA implementations of the algorithms are carried out, and the CABF algorithm is shown to achieve a better arithmetic intensity. A 6000-element array is designed with a simultaneous forming of 200 beams to test the efficacy of cabf in a multicore platform. The results show a 73 % increase in GFLOPS for FMA operations. The performance of the beamformer algorithm for different data sizes is studied, and on average, a 36 % improvement in computational performance is achieved compared to the OpenMP-based implementation.

Journal Article

Share this book

Add to My Shelf

A Parallel Approach to Enhance the Performance of Supervised Machine Learning Realized in a Multicore Environment

by Amsaad, Fathi , Ghimire, Ashutosh in Accuracy , Algorithms , Back propagation

2024

Machine learning models play a critical role in applications such as image recognition, natural language processing, and medical diagnosis, where accuracy and efficiency are paramount. As datasets grow in complexity, so too do the computational demands of classification techniques. Previous research has achieved high accuracy but required significant computational time. This paper proposes a parallel architecture for Ensemble Machine Learning Models, harnessing multicore CPUs to expedite performance. The primary objective is to enhance machine learning efficiency without compromising accuracy through parallel computing. This study focuses on benchmark ensemble models including Random Forest, XGBoost, ADABoost, and K Nearest Neighbors. These models are applied to tasks such as wine quality classification and fraud detection in credit card transactions. The results demonstrate that, compared to single-core processing, machine learning tasks run 1.7 times and 3.8 times faster for small and large datasets on quad-core CPUs, respectively.

Journal Article

Share this book

Add to My Shelf

ORGs for Scalable, Robust, Privacy-Friendly Client Cloud Computing

by Hewitt, C. in Applied sciences , client cloud computing , Cloud computing

2008

The advent of multicore architecture stands to transform cloud computing in terms of scalability, robustness, and privacy. Social systems offer promising metaphors to address these issues. This column presents an approach based on organizations of restricted generality (ORGs), which are analogous to human organizations.

Journal Article

Share this book

Add to My Shelf

Swarm Verification Techniques

by Joshi, R. , Holzmann, G. J. , Groce, A. in Algorithms , Analysis , Central processing units

2011

The range of verification problems that can be solved with logic model checking tools has increased significantly in the last few decades. This increase in capability is based on algorithmic advances and new theoretical insights, but it has also benefitted from the steady increase in processing speeds and main memory sizes on standard computers. The steady increase in processing speeds, though, ended when chip-makers started redirecting their efforts to the development of multicore systems. For the near-term future, we can anticipate the appearance of systems with large numbers of CPU cores, but without matching increases in clock-speeds. We will describe a model checking strategy that can allow us to leverage this trend and that allows us to tackle significantly larger problem sizes than before.

Journal Article

Share this book

Add to My Shelf

Improving Multicore Architectures by Selective Value Prediction of High-Latency Arithmetic Instructions

by GELLERT, A. , BRAD, R. , BUDULECI, C. in Architecture , benchmark testing , Benchmarks

2024

This work is an original contribution consisting in the implementation and evaluation of a selective value predictor in a multicore environment, with focus on long latency arithmetical instructions, having the goal to break the dataflow bottleneck of each core, thus increasing the overall performance. The Sniper simulator was used to augment the Intel Nehalem architecture with a value predictor and to estimate the computing performance, area of integration, power consumption, energy efficiency and chip temperature for the enhanced architecture. We run simulations and study the impact of the number of values which are used for prediction for each instruction. By increasing the history length, we measured on average more than 3 % increase in performance (core speed-up), a reduction in chip temperature from 57.8 °C to 56.17 °C, and lower energy consumption in most cases compared with the baseline configuration. We also realized a comparison between the value prediction and dynamic instruction reuse techniques in equitable condition (to exploit the same value locality), where we highlight the advantages and disadvantages of each technique in the given context.

Journal Article

Share this book

Add to My Shelf

Hardware-Efficient VLSI Design for Cascade Support Vector Machine with On-Chip Training and Classification Capability

by Merin, Loukrakpam , Choudhury Madhuchhanda in Algorithms , Classification , Embedded systems

2020

Local processing of machine learning algorithms like support vector machine (SVM) is preferred over the cloud for many real-time embedded applications. However, such embedded systems often have stringent energy constraints besides throughput and accuracy requirements. Hence, hardware-efficient design to compute SVM is critical to enable these applications. In this paper, a hardware-efficient SVM learning unit is proposed using reduced number of multiplications and approximate computing techniques. These design techniques helped the learning unit to achieve 46.97% and 35.72% reductions in area and power when compared with those of the design using full multipliers. The proposed SVM learning unit supports on-chip training and classification. Energy-efficient dual-core, quad-core and octa-core cascade SVM systems were developed using the proposed SVM learning unit to expedite the on-chip training process. The runtime and energy efficiency of the cascade SVM systems improved with an increase in the number of cores. Interestingly, an average speedup of 421x in training time and a remarkable energy reduction of 24,497x were observed for the octa-core cascade SVM system when compared with the software SVM solution running on Intel Core i5-5257U processor. Moreover, the proposed octa-core cascade SVM system showed 73.75% and 65.78% lower area and power, respectively, than those of state-of-the-art cascade SVM architecture.

Journal Article

Share this book

Add to My Shelf

Performance Comparison of Different OpenCL Implementations of LBM Simulation on Commodity Computer Hardware

by RACKOVIC, M. , TEKIC, P. , TEKIC, J. in Algorithms , Approximation , Arrays

2022

Abstract-Parallel programming is increasingly used to improve the performance of solving numerical methods used for scientific purposes. Numerical methods in the field of fluid dynamics require the calculation of a large number of operations per second. One of the methods that is easily parallelized and often used is the Lattice Boltzmann method (LBM). Today, it is possible to perform simulations of numerical methods not only on high performance computers (HPC) but also on commodity computers. In this paper is presented how to accelerate LBM implementation on commodity computers using characteristics of OpenCL specification. Simulation is executed simultaneously on multiple heterogeneous devices. Four different approaches for several commodity computer configurations are presented. Obtained results are compared for different types of commodity computers and advantages and disadvantages are discussed. In this paper it presented which LBM OpenCL code implementation, among four different presented, shows best simulation performance and should be used when solving similar CFD problems.

Journal Article

Share this book

Add to My Shelf

The Design of a Multicore Extension of the SPIN Model Checker

by Holzmann, G.J. , Bosnacki, D. in Algorithm design and analysis , Algorithms , Central Processing Unit

2007

We describe an extension of the SPIN model checker for use on multicore shared-memory systems and report on its performance. We show how, with proper load balancing, the time requirements of a verification run can, in some cases, be reduced close to N-fold when N processing cores are used. We also analyze the types of verification problems for which multicore algorithms cannot provide relief. The extensions discussed here require only relatively small changes in the SPIN source code and are compatible with most existing verification modes such as partial order reduction, the verification of temporal logic formulas, bitstate hashing, and hash-compact compression.

Journal Article

Share this book

Add to My Shelf

Measurement-Based Power Optimization Technique for OpenCV on Heterogeneous Multicore Processor

by Jung, Hyeonseok , Yang, Hoeseok , Koo, Kyoseung in Design , Edge detection , Embedded systems

2019

Today’s embedded systems often operate computer-vision applications, and are associated with timing and power constraints. Since it is not simple to capture the symmetry between the application and the model, the model-based design approach is generally not applicable to the optimization of computer-vision applications. Thus, in this paper, we propose a measurement-based optimization technique for an open-source computer-vision application library, OpenCV, on top of a heterogeneous multicore processor. The proposed technique consists of two sub-systems: the optimization engine running on a separate host PC, and the measurement library running on the target board. The effectiveness of the proposed optimization technique has been verified in the case study of latency-power co-optimization by using two OpenCV applications—canny edge detection and squeezeNet. It has been shown that the proposed technique not only enables broader design space exploration, but also improves optimality.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter