Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Reading Level
      Reading Level
      Clear All
      Reading Level
  • Content Type
      Content Type
      Clear All
      Content Type
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
      More Filters
      Clear All
      More Filters
      Item Type
    • Is Full-Text Available
    • Subject
    • Publisher
    • Source
    • Donor
    • Language
    • Place of Publication
    • Contributors
    • Location
15 result(s) for "Intel microprocessors Programming."
Sort by:
Revisiting the performance optimization of QR factorization on Intel KNL and SKL multiprocessors
This study focused on the optimization of double-precision general matrix–matrix multiplication (DGEMM) routine to improve the QR factorization performance. By replacing the MKL DGEMM with our previously developed blocked matrix–matrix multiplication routine, we found that the QR factorization performance was suboptimal due to a bottleneck in the A T · B matrix–panel multiplication operation. We present an investigation of the limitations of our matrix–matrix multiplication routine. It was found that the performance of the matrix multiplication routine depends on the shape and size of the matrices. Therefore, we recommend different kernels tailored to matrix shapes involved in QR factorization and developed a new routine for the A T · B matrix–panel multiplication operation. We demonstrated the performance of the proposed kernels on the ScaLAPACK QR factorization routine by comparing them with the MKL, OPENBLAS, and BLIS libraries. Our proposed optimization demonstrates significant performance improvements in the multinode cluster environments of the Intel Xeon Phi Processor 7250 codenamed Knights Landing (KNL) and Intel Xeon Gold 6148 Scalable Skylake Processor (SKL).
First Impressions of the Sapphire Rapids Processor with HBM for Scientific Workloads
The landscape of high performance computing (HPC) has witnessed exponential growth in processor diversity, architectural complexity, and performance scalability. With an ever-increasing demand for faster and more efficient computing solutions to address an array of scientific, engineering, and societal challenges, the selection of processors for specific applications becomes paramount. Achieving optimal performance requires a deep understanding of how diverse processors interact with diverse workloads, making benchmarking a fundamental practice in the field of HPC. Here, we present preliminary results observed over such benchmarks and applications and a comparison of Intel Sapphire Rapids and Skylake-X, AMD Milan, and Fujitsu A64FX processors in terms of runtime performance, memory bandwidth utilization, and energy consumption. The examples focus specifically on the Sapphire Rapids processor with and without high-bandwidth memory (HBM). An additional case study reports the performance gains from using Intel’s Advanced Matrix Extensions (AMX) instructions, and how they along with HBM can be leveraged to accelerate AI workloads. These initial results aim to give a rough comparison of the processors rather than a detailed analysis and should prove timely and relevant for researchers who may be interested in using Sapphire Rapids for their scientific workloads.
SWIMM 2.0: Enhanced Smith–Waterman on Intel’s Multicore and Manycore Architectures Based on AVX-512 Vector Extensions
The well-known Smith–Waterman (SW) algorithm is the most commonly used method for local sequence alignments, but its acceptance is limited by the computational requirements for large protein databases. Although the acceleration of SW has already been studied on many parallel platforms, there are hardly any studies which take advantage of the latest Intel architectures based on AVX-512 vector extensions. This SIMD set is currently supported by Intel’s Knights Landing (KNL) accelerator and Intel’s Skylake (SKL) general purpose processors. In this paper, we present an SW version that is optimized for both architectures: the renowned SWIMM 2.0. The novelty of this vector instruction set requires the revision of previous programming and optimization techniques. SWIMM 2.0 is based on a massive multi-threading and SIMD exploitation. It is competitive in terms of performance compared with other state-of-the-art implementations, reaching 511 GCUPS on a single KNL node and 734 GCUPS on a server equipped with a dual SKL processor. Moreover, these successful performance rates make SWIMM 2.0 the most efficient energy footprint implementation in this study achieving 2.94 GCUPS/Watts on the SKL processor.
Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors
Stencil-based kernels constitute the core of many important scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. In this paper, we explore the impact of trends in memory subsystems on a variety of stencil optimization techniques and develop performance models to analytically guide our optimizations. Our work targets cache reuse methodologies across single and multiple stencil sweeps, examining cache-aware algorithms as well as cache-oblivious techniques on the Intel Itanium2, AMD Opteron, and IBM Power5. Additionally, we consider stencil computations on the heterogeneous multicore design of the Cell processor, a machine with an explicitly managed memory hierarchy. Overall our work represents one of the most extensive analyses of stencil optimizations and performance modeling to date. Results demonstrate that recent trends in memory system organization have reduced the efficacy of traditional cache-blocking optimizations. We also show that a cache-aware implementation is significantly faster than a cache-oblivious approach, while the explicitly managed memory on Cell enables the highest overall efficiency: Cell attains 88% of algorithmic peak while the best competing cache-based processor achieves only 54% of algorithmic peak performance.
Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors
The general matrix–matrix multiplication is a core building block for implementing Basic Linear Algebra Subprograms. This paper presents a methodology for automatically producing the matrix–matrix multiplication kernels tuned for the Intel Xeon Phi Processor code-named Knights Landing and the Intel Skylake-SP processors with AVX-512 intrinsic functions. The architecture of the latest manycore processors has been complicated in the levels of parallelism and cache hierarchies; it is not easy to find the best combination of optimization techniques for a given application. Our approach produces matrix multiplication kernels through a process of heuristic auto-tuning based on generating multiple kernels and selecting the fastest ones through performance tests. The tuning parameters include the size of block matrices for registers and caches, prefetch distances, and loop unrolling depth. Parameters for multithreaded execution, such as identifying loops to parallelize and the optimal number of threads for such loops are also investigated. We also present a method to reduce the parameter search space based on our previous research results.
Comparing quantum annealing and spiking neuromorphic computing for sampling binary sparse coding QUBO problems
We consider the problem of computing a sparse binary representation of an image. Given an image and an overcomplete, non-orthonormal basis, we aim to find a sparse binary vector indicating the minimal set of basis vectors that when added together best reconstruct the given input. We formulate this problem with an L 2 loss on the reconstruction error, and an L 0 loss on the binary vector enforcing sparsity. First, we solve the sparse representation QUBOs by solving them both on a D-Wave quantum annealer with Pegasus chip connectivity, as well as on the Intel Loihi 2 spiking neuromorphic processor using a stochastic Non-equilibrium Boltzmann Machine (NEBM). Second, using Quantum Evolution Monte Carlo with Reverse Annealing and iterated warm starting on Loihi 2 to evolve the solution quality from the respective machines. We demonstrate that both quantum annealing and neuromorphic computing are suitable for solving binary sparse coding QUBOs.
Out-of-core implementation for accelerator kernels on heterogeneous clouds
Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to facilitate easier migration to them of HPC workloads. While virtualization of accelerators in clouds is a leading research challenge, we address the programming challenges that assail execution of large instances of data-parallel applications using these accelerators in this paper. In a typical hybrid node in a cloud, the tight integration of accelerators with multicore CPUs via PCI-E communication links contains inherent limitations such as limited main memory of accelerators and limited bandwidth of the PCI-E communication links. These limitations poses formidable programming challenges to execution of large problem sizes on these accelerators. In this paper, we describe a library containing interfaces (HCLOOC) that addresses these challenges. It employs optimal software pipelines to overlap data transfers between host CPU and the accelerator and computations on the accelerator. It is designed using the fundamental building blocks, which are OpenCL command queues for FPGAs, Intel offload streams for Intel Xeon Phis, and CUDA streams and events that allow concurrent utilization of the copy and execution engines provided in NVidia GPUs. We elucidate the key features of our library using an out-of-core implementation of matrix multiplication of large dense matrices on a hybrid node, an Intel Haswell multicore CPU server hosting three accelerators that includes NVidia K40c GPU, Intel Xeon Phi 3120P, and a Xilinx FPGA. Based on experiments with the GPU, we show that our out-of-core implementation achieves 82% of peak double-precision floating performance of the GPU and a speedup of 2.7 times over the NVidia’s out-of-core matrix multiplication implementation (CUBLAS-XT). We also demonstrate that our implementation exhibits 0% drop in performance when the problem size exceeds the main memory of the GPU. We observe this 0% drop also for our implementation for Intel Xeon Phi and Xilinx FPGA.
Accelerating time series motif discovery in the Intel Xeon Phi KNL processor
Time series analysis is an important research topic of great interest in many fields. Recently, the Matrix Profile method, and particularly one of its implementations—the SCRIMP algorithm—has become a state-of-the-art approach in this field. This is a technique that brings the possibility of obtaining exact motifs from a time series, which can be used to infer events, predict outcomes, detect anomalies and more. However, the memory-bound nature of the SCRIMP algorithm limits the execution performance in some processor architectures. In this paper, we analyze the SCRIMP algorithm from the performance viewpoint in the context of the Intel Xeon Phi Knights Landing architecture (KNL), which integrates high-bandwidth memory (HBM) modules, and we combine several techniques aimed at exploiting the potential of this architecture. On the one hand, we exploit the multi-threading and vector capabilities of the architecture. On the other hand, we explore how to allocate data in order to take advantage of the available hybrid memory architecture that conjugates both the high-bandwidth 3D-stacked HBM and the DDR4 memory modules. The experimental evaluation shows a performance improvement up to 190× with respect to the sequential execution and that the use of the HBM memory improves performance in a factor up to 5× with respect to the DDR4 memory.
Design and implementation of an elastic processor with hyperthreading technology and virtualization for elastic server models
The server models functioning in the industry are required to be more elastic in nature. They are constantly scaling-up and scaling-down on required computation power depending on different conditions. These elastic cloud platforms use accelerators like DSP’s, TPU’s, GPU’s, FPGA’s, and multi-core processors to provide exponential computing power and outsource their services. This process is not only costly and non-efficient but is also responsible for damaging the server’s hardware architecture. Furthermore, these additions degrade the level of threading and symmetric parallel processing capability of the architecture. Intel uses hyperthreading technology (HTT) to split the workload between hardware and operating system to avoid additions, but that too is only possible up until a certain limit. This paper presents design methodology and implementation of an elastic-natured 32-bit RISC-pipelined processor inspired from Intel Xeon and MIPS to function as a standard integrated platform for server models. It implements concepts of hyperthreading technology (HTT) and virtualization on hardware basis. It will allow to derive multiple outputs from units on hardware basis to enhance security and performance without compromising compatibility. The designed elastic core uses a probabilistic node-based closed-queuing network model for server analysis and implementation. Hence, elastic behavior from individual core microarchitecture to server model architecture enables a generic automated scaling self-aware optimization architecture.