Catalogue Search | MBRL

Low-level programming : C, assembly, and program execution on Intel 64 architecture

by Zhirkov, Igor, author in Computer programming. , Programming languages (Electronic computers) , C (Computer program language)

Book

Share this book

Add to My Shelf

Revisiting the performance optimization of QR factorization on Intel KNL and SKL multiprocessors

by Rizwan, Muhammad , Jung, Enoch , Choi, Jaeyoung in Algorithms , Compilers , Computer Science

2024

This study focused on the optimization of double-precision general matrix–matrix multiplication (DGEMM) routine to improve the QR factorization performance. By replacing the MKL DGEMM with our previously developed blocked matrix–matrix multiplication routine, we found that the QR factorization performance was suboptimal due to a bottleneck in the A T · B matrix–panel multiplication operation. We present an investigation of the limitations of our matrix–matrix multiplication routine. It was found that the performance of the matrix multiplication routine depends on the shape and size of the matrices. Therefore, we recommend different kernels tailored to matrix shapes involved in QR factorization and developed a new routine for the A T · B matrix–panel multiplication operation. We demonstrated the performance of the proposed kernels on the ScaLAPACK QR factorization routine by comparing them with the MKL, OPENBLAS, and BLIS libraries. Our proposed optimization demonstrates significant performance improvements in the multinode cluster environments of the Intel Xeon Phi Processor 7250 codenamed Knights Landing (KNL) and Intel Xeon Gold 6148 Scalable Skylake Processor (SKL).

Journal Article

Share this book

Add to My Shelf

First Impressions of the Sapphire Rapids Processor with HBM for Scientific Workloads

by Chheda, Smeet , Coskun, Firat , Siegmann, Eva in Bandwidths , Benchmarks , Computation

2024

The landscape of high performance computing (HPC) has witnessed exponential growth in processor diversity, architectural complexity, and performance scalability. With an ever-increasing demand for faster and more efficient computing solutions to address an array of scientific, engineering, and societal challenges, the selection of processors for specific applications becomes paramount. Achieving optimal performance requires a deep understanding of how diverse processors interact with diverse workloads, making benchmarking a fundamental practice in the field of HPC. Here, we present preliminary results observed over such benchmarks and applications and a comparison of Intel Sapphire Rapids and Skylake-X, AMD Milan, and Fujitsu A64FX processors in terms of runtime performance, memory bandwidth utilization, and energy consumption. The examples focus specifically on the Sapphire Rapids processor with and without high-bandwidth memory (HBM). An additional case study reports the performance gains from using Intel’s Advanced Matrix Extensions (AMX) instructions, and how they along with HBM can be leveraged to accelerate AI workloads. These initial results aim to give a rough comparison of the processors rather than a detailed analysis and should prove timely and relevant for researchers who may be interested in using Sapphire Rapids for their scientific workloads.

Journal Article

Share this book

Add to My Shelf

SWIMM 2.0: Enhanced Smith–Waterman on Intel’s Multicore and Manycore Architectures Based on AVX-512 Vector Extensions

by Carlos Garcia Sanchez , Naiouf, Marcelo , De Giusti, Armando in Energy consumption , Energy management , Microprocessors

2019

The well-known Smith–Waterman (SW) algorithm is the most commonly used method for local sequence alignments, but its acceptance is limited by the computational requirements for large protein databases. Although the acceleration of SW has already been studied on many parallel platforms, there are hardly any studies which take advantage of the latest Intel architectures based on AVX-512 vector extensions. This SIMD set is currently supported by Intel’s Knights Landing (KNL) accelerator and Intel’s Skylake (SKL) general purpose processors. In this paper, we present an SW version that is optimized for both architectures: the renowned SWIMM 2.0. The novelty of this vector instruction set requires the revision of previous programming and optimization techniques. SWIMM 2.0 is based on a massive multi-threading and SIMD exploitation. It is competitive in terms of performance compared with other state-of-the-art implementations, reaching 511 GCUPS on a single KNL node and 734 GCUPS on a server equipped with a dual SKL processor. Moreover, these successful performance rates make SWIMM 2.0 the most efficient energy footprint implementation in this study achieving 2.94 GCUPS/Watts on the SKL processor.

Journal Article

Share this book

Add to My Shelf

Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

by Shalf, John , Datta, Kaushik , Kamil, Shoaib in Algorithmics. Computability. Computer arithmetics , Applied sciences , Architectural models

2009

Stencil-based kernels constitute the core of many important scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. In this paper, we explore the impact of trends in memory subsystems on a variety of stencil optimization techniques and develop performance models to analytically guide our optimizations. Our work targets cache reuse methodologies across single and multiple stencil sweeps, examining cache-aware algorithms as well as cache-oblivious techniques on the Intel Itanium2, AMD Opteron, and IBM Power5. Additionally, we consider stencil computations on the heterogeneous multicore design of the Cell processor, a machine with an explicitly managed memory hierarchy. Overall our work represents one of the most extensive analyses of stencil optimizations and performance modeling to date. Results demonstrate that recent trends in memory system organization have reduced the efficacy of traditional cache-blocking optimizations. We also show that a cache-aware implementation is significantly faster than a cache-oblivious approach, while the explicitly managed memory on Cell enables the highest overall efficiency: Cell attains 88% of algorithmic peak while the best competing cache-based processor achieves only 54% of algorithmic peak performance.

Journal Article

Share this book

Add to My Shelf

Comparing quantum annealing and spiking neuromorphic computing for sampling binary sparse coding QUBO problems

by Pelofske, Elijah , Kenyon, Garrett , Hahn, Georg in Algorithms , Approximation , Binary codes

2025

We consider the problem of computing a sparse binary representation of an image. Given an image and an overcomplete, non-orthonormal basis, we aim to find a sparse binary vector indicating the minimal set of basis vectors that when added together best reconstruct the given input. We formulate this problem with an L 2 loss on the reconstruction error, and an L 0 loss on the binary vector enforcing sparsity. First, we solve the sparse representation QUBOs by solving them both on a D-Wave quantum annealer with Pegasus chip connectivity, as well as on the Intel Loihi 2 spiking neuromorphic processor using a stochastic Non-equilibrium Boltzmann Machine (NEBM). Second, using Quantum Evolution Monte Carlo with Reverse Annealing and iterated warm starting on Loihi 2 to evolve the solution quality from the respective machines. We demonstrate that both quantum annealing and neuromorphic computing are suitable for solving binary sparse coding QUBOs.

Journal Article

Share this book

Add to My Shelf

Benchmarking Performance of a Hybrid Intel Xeon/Xeon Phi System for Parallel Computation of Similarity Measures Between Large Vectors

by Czarnul, Paweł in Benchmarks , CD burners , Central processing units

2017

The paper deals with parallelization of computing similarity measures between large vectors. Such computations are important components within many applications and consequently are of high importance. Rather than focusing on optimization of the algorithm itself, assuming specific measures, the paper assumes a general scheme for finding similarity measures for all pairs of vectors and investigates optimizations for scalability in a hybrid Intel Xeon/Xeon Phi system. Hybrid systems including multicore CPUs and many-core compute devices such as Intel Xeon Phi allow parallelization of such computations using vectorization but require proper load balancing and optimization techniques. The proposed implementation uses C/OpenMP with the offload mode to Xeon Phi cards. Several results are presented: execution times for various partitioning parameters such as batch sizes of vectors being compared, impact of dynamic adjustment of batch size, overlapping computations and communication. Execution times for comparison of all pairs of vectors are presented as well as those for which similarity measures account for a predefined threshold. The latter makes load balancing more difficult and is used as a benchmark for the proposed optimizations. Results are presented for the native mode on an Intel Xeon Phi, CPU only and the CPU + offload mode for a hybrid system with 2 Intel Xeons with 20 physical cores and 40 logical processors and 2 Intel Xeon Phis with a total of 120 physical cores and 480 logical processors.

Journal Article

Share this book

Add to My Shelf

Explicit Fourth-Order Runge–Kutta Method on Intel Xeon Phi Coprocessor

by Potiopa, Joanna , Bylina, Beata in CD burners , Computer architecture , Computer Science

2017

This paper concerns an Intel Xeon Phi implementation of the explicit fourth-order Runge–Kutta method (RK4) for very sparse matrices with very short rows. Such matrices arise during Markovian modeling of computer and telecommunication networks. In this work an implementation based on Intel Math Kernel Library (Intel MKL) routines and the authors’ own implementation, both using the CSR storage scheme and working on Intel Xeon Phi, were investigated. The implementation based on the Intel MKL library uses the high-performance BLAS and Sparse BLAS routines. In our application we focus on OpenMP style programming. We implement SpMV operation and vector addition using the basic optimizing techniques and the vectorization. We evaluate our approach in native and offload modes for various number of cores and thread allocation affinities. Both implementations (based on Intel MKL and made by the authors) were compared in respect of the time, the speedup and the performance. The numerical experiments on Intel Xeon Phi show that the performance of authors’ implementation is very promising and gives a gain of up to two times compared to the multithreaded implementation (based on Intel MKL) running on CPU (Intel Xeon processor) and even three times in comparison with the application which uses Intel MKL on Intel Xeon Phi.

Journal Article

Share this book

Add to My Shelf

High-performance workstation based on Intel architecture

by Trybus, Damian in Computer science , Electrical engineering

1998

The performance of personal computers has been increasing steadily since their introduction in the early eighties. The personal computers available today employ technologies that were not long time ago exclusively used on super computers and main-frames. One of the most recent developments in the personal computer technology is the usage of multiprocessors in order to increase the computing power. However, since it is known that computing power increase is not directly proportional to the number of processors, the performance increase gained from additional processors is difficult to be predicted. Many researchers claim that personal computer technology, based on Intel x86 design, has scaling problems when multiple processors are used. Computer manufacturers so far have failed to provide reliable information related to scaling problems. In order to analyze the performance gain from an additional processor in a personal computer and to study the performance dependence of the computer architecture on the operating system, a dual processor computer system was designed and built from the functional blocks available on the market. The fast Fourier transform algorithm was adapted and modified, to be used as the computation intensive performance indicator. The FFT algorithm was run on the designed two-processor system under Windows NT Workstation version 4.0 and Linux version 2.0 with and without the use of the threading techniques and the performance gain due to the second processor was determined. It was observed that the addition of the second processor resulted in the maximum performance increase of 80 percent under Windows NT and 60 percent under Linux, when the threading techniques were used. Furthermore it was observed that without the use of the threading, the addition of the second processor, under specific conditions, impaired the system's performance. Details of the work are presented in the thesis.

Dissertation

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter