Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Series Title
      Series Title
      Clear All
      Series Title
  • Reading Level
      Reading Level
      Clear All
      Reading Level
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
      More Filters
      Clear All
      More Filters
      Content Type
    • Item Type
    • Is Full-Text Available
    • Subject
    • Country Of Publication
    • Publisher
    • Source
    • Target Audience
    • Donor
    • Language
    • Place of Publication
    • Contributors
    • Location
40 result(s) for "Loop tiling (Computer science)"
Sort by:
Looping with Disney Pixar Finding Dory
A simple, low-level, unplugged introduction to looping designed for young readers not yet ready for coding on computers. Beloved characters Dory and Nemo, from the world-famous Disney movie Finding Dory, draw in readers new to coding concepts-- Provided by publisher.
Revisiting split tiling for stencil computations in polyhedral compilation
Complex tile shapes maximize parallelism and locality of stencil computations by enabling tile-wise concurrent start, i.e., all tiles along a particular tiling direction of the iteration space can be started concurrently. We study split tiling—a tiling technique exploiting tile-wise concurrent start at the expense of additional synchronizations, in the context of polyhedral compilation. Derived from classical parallelogram tiling, our approach first splits a parallelogram tile into multiple phases that can be executed simultaneously with those of the neighboring tiles. The technique then minimizes the amount of synchronizations by merging boundary phases of consecutive tiles along the time-tiled direction. We implement our approach on top of a well-defined polyhedral representation, generating code for both CPUs and GPUs. The experimental results on a 16-core Intel Xeon Silver show that our approach can achieve an average improvement of 2
A Methodology for Efficient Tile Size Selection for Affine Loop Kernels
Reducing the number of data accesses in memory hierarchy is of paramount importance on modern computer systems. One of the key optimizations addressing this problem is loop tiling, a well-known loop transformation that enhances data locality in memory hierarchy. The selection of an appropriate tile size is tackled by using both static (analytical) and dynamic empirical (auto-tuning) methods. Current analytical models are not accurate enough to effectively model the complex modern memory hierarchies and loop kernels with diverse characteristics, while auto-tuning methods are either too time-consuming (due to the huge search space) or less accurate (when heuristics are used to reduce the search space). In this paper, we reveal two important inefficiencies of current analytical loop tiling methods and we provide the theoretical background on how current methods can address these inefficiencies. To this end, we propose a new loop tiling method for affine loop kernels where the cache size, cache line size and cache associativity are better utilized, compared to the existing methods. Our evaluation results prove the efficiency of the proposed method in terms of cache misses and execution time, against related works, icc/gcc compilers and Pluto tool, on x86 and ARM based platforms.
Time and Energy Benefits of Using Automatic Optimization Compilers for NPDP Tasks
In this article, we analyze the program codes generated automatically using three advanced optimizers: Pluto, Traco, and Dapt, which are specifically tailored for the NPDP benchmark set. This benchmark set comprises ten program loops, predominantly from the field of bioinformatics. The codes exemplify dynamic programming, a challenging task for well-known tools used in program loop optimization. Given the intricacy involved, we opted for three automatic compilers based on the polyhedral model and various loop-tiling strategies. During our evaluation of the code’s performance, we meticulously considered locality and concurrency to accurately estimate time and energy efficiency. Notably, we dedicated significant attention to the latest Dapt compiler, which applies space–time loop tiling to generate highly efficient code for the NPDP benchmark suite loops. By employing the aforementioned optimizers and conducting an in-depth analysis, we aim to demonstrate the effectiveness and potential of automatic transformation techniques in enhancing the performance and energy efficiency of dynamic programming codes.
A methodology correlating code optimizations with data memory accesses, execution time and energy consumption
The advent of data proliferation and electronic devices gets low execution time and energy consumption software in the spotlight. The key to optimizing software is the correct choice, order as well as parameters of optimization transformations that has remained an open problem in compilation research for decades for various reasons. First, most of the transformations are interdependent and thus addressing them separately is not effective. Second, it is very hard to couple the transformation parameters to the processor architecture (e.g., cache size) and algorithm characteristics (e.g., data reuse); therefore, compiler designers and researchers either do not take them into account at all or do it partly. Third, the exploration space, i.e., the set of all optimization configurations that have to be explored, is huge and thus searching is impractical. In this paper, the above problems are addressed for data-dominant affine loop kernels, delivering significant contributions. A novel methodology is presented reducing the exploration space of six code optimizations by many orders of magnitude. The objective can be execution time (ET), energy consumption (E) or the number of L1, L2 and main memory accesses. The exploration space is reduced in two phases: firstly, by applying a novel register blocking algorithm and a novel loop tiling algorithm and secondly, by computing the maximum and minimum ET/E values for each optimization set. The proposed methodology has been evaluated for both embedded and general-purpose CPUs and for seven well-known algorithms, achieving high memory access, speedup and energy consumption gain values (from 1.17 up to 40) over gcc compiler, hand-written optimized code and Polly. The exploration space from which the near-optimum parameters are selected is reduced from 17 up to 30 orders of magnitude.
Generating Loop Patterns with a Genetic Algorithm and a Probabilistic Cellular Automata Rule
The objective is to find a Cellular Automata (CA) rule that can generate “loop patterns”. A loop pattern is given by ones on a zero background showing loops. In order to find out how loop patterns can be locally defined, tentative loop patterns are generated by a genetic algorithm in a preliminary stage. A set of local matching tiles is designed and checked whether they can produce the aimed loop patterns by the genetic algorithm. After having approved a certain set of tiles, a probabilistic CA rule is designed in a methodical way. Templates are derived from the tiles, which then are used in the CA rule for matching. In order to drive the evolution to the desired patterns, noise is injected if the templates do not match or other constraints are not fulfilled. Simulations illustrate that loops and connected loops can be evolved by the CA rule.
Space-Time Loop Tiling for Dynamic Programming Codes
We present a new space-time loop tiling approach and demonstrate its application for the generation of parallel tiled code of enhanced locality for three dynamic programming algorithms. The technique envisages that, for each loop nest statement, sub-spaces are first generated so that the intersection of them results in space tiles. Space tiles can be enumerated in lexicographical order or in parallel by using the wave-front technique. Then, within each space tile, time slices are formed, which are enumerated in lexicographical order. Target tiles are represented with multiple time slices within each space tile. We explain the basic idea of space-time loop tiling and then illustrate it by means of an example. Then, we present a formal algorithm and prove its correctness. The algorithm is implemented in the publicly available TRACO compiler. Experimental results demonstrate that parallel codes generated by means of the presented approach outperform closely related manually generated ones or those generated by using affine transformations. The main advantage of code generated by means of the presented approach is its enhanced locality due to splitting each larger space tile into multiple smaller tiles represented with time slices.
A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures
Current compilers cannot generate code that can compete with hand-tuned code in efficiency, even for a simple kernel like matrix–matrix multiplication (MMM). A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and number of levels of tiling. The scheduling parameter values selection is a very difficult and time-consuming task, since parameter values depend on each other; this is why they are found by using searching methods and empirical techniques. To overcome this problem, the scheduling sub-problems must be optimized together, as one problem and not separately. In this paper, an MMM methodology is presented where the optimum scheduling parameters are found by decreasing the search space theoretically, while the major scheduling sub-problems are addressed together as one problem and not separately according to the hardware architecture parameters and input size; for different hardware architecture parameters and/or input sizes, a different implementation is produced. This is achieved by fully exploiting the software characteristics (e.g., data reuse) and hardware architecture parameters (e.g., data caches sizes and associativities), giving high-quality solutions and a smaller search space. This methodology refers to a wide range of CPU and GPU architectures.
A methodology pruning the search space of six compiler transformations by addressing them together as one problem and by exploiting the hardware architecture details
Today’s compilers have a plethora of optimizations-transformations to choose from, and the correct choice, order as well parameters of transformations have a significant/large impact on performance; choosing the correct order and parameters of optimizations has been a long standing problem in compilation research, which until now remains unsolved; the separate sub-problems optimization gives a different schedule/binary for each sub-problem and these schedules cannot coexist, as by refining one degrades the other. Researchers try to solve this problem by using iterative compilation techniques but the search space is so big that it cannot be searched even by using modern supercomputers. Moreover, compiler transformations do not take into account the hardware architecture details and data reuse in an efficient way. In this paper, a new iterative compilation methodology is presented which reduces the search space of six compiler transformations by addressing the above problems; the search space is reduced by many orders of magnitude and thus an efficient solution is now capable to be found. The transformations are the following: loop tiling (including the number of the levels of tiling), loop unroll, register allocation, scalar replacement, loop interchange and data array layouts. The search space is reduced (a) by addressing the aforementioned transformations together as one problem and not separately, (b) by taking into account the custom hardware architecture details (e.g., cache size and associativity) and algorithm characteristics (e.g., data reuse). The proposed methodology has been evaluated over iterative compilation and gcc/icc compilers, on both embedded and general purpose processors; it achieves significant performance gains at many orders of magnitude lower compilation time.