Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Series Title
      Series Title
      Clear All
      Series Title
  • Reading Level
      Reading Level
      Clear All
      Reading Level
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
      More Filters
      Clear All
      More Filters
      Content Type
    • Item Type
    • Is Full-Text Available
    • Subject
    • Country Of Publication
    • Publisher
    • Source
    • Target Audience
    • Donor
    • Language
    • Place of Publication
    • Contributors
    • Location
29 result(s) for "Loop tiling (Computer science)"
Sort by:
Looping with Disney Pixar Finding Dory
A simple, low-level, unplugged introduction to looping designed for young readers not yet ready for coding on computers. Beloved characters Dory and Nemo, from the world-famous Disney movie Finding Dory, draw in readers new to coding concepts-- Provided by publisher.
A Methodology for Efficient Tile Size Selection for Affine Loop Kernels
Reducing the number of data accesses in memory hierarchy is of paramount importance on modern computer systems. One of the key optimizations addressing this problem is loop tiling, a well-known loop transformation that enhances data locality in memory hierarchy. The selection of an appropriate tile size is tackled by using both static (analytical) and dynamic empirical (auto-tuning) methods. Current analytical models are not accurate enough to effectively model the complex modern memory hierarchies and loop kernels with diverse characteristics, while auto-tuning methods are either too time-consuming (due to the huge search space) or less accurate (when heuristics are used to reduce the search space). In this paper, we reveal two important inefficiencies of current analytical loop tiling methods and we provide the theoretical background on how current methods can address these inefficiencies. To this end, we propose a new loop tiling method for affine loop kernels where the cache size, cache line size and cache associativity are better utilized, compared to the existing methods. Our evaluation results prove the efficiency of the proposed method in terms of cache misses and execution time, against related works, icc/gcc compilers and Pluto tool, on x86 and ARM based platforms.
Time and Energy Benefits of Using Automatic Optimization Compilers for NPDP Tasks
In this article, we analyze the program codes generated automatically using three advanced optimizers: Pluto, Traco, and Dapt, which are specifically tailored for the NPDP benchmark set. This benchmark set comprises ten program loops, predominantly from the field of bioinformatics. The codes exemplify dynamic programming, a challenging task for well-known tools used in program loop optimization. Given the intricacy involved, we opted for three automatic compilers based on the polyhedral model and various loop-tiling strategies. During our evaluation of the code’s performance, we meticulously considered locality and concurrency to accurately estimate time and energy efficiency. Notably, we dedicated significant attention to the latest Dapt compiler, which applies space–time loop tiling to generate highly efficient code for the NPDP benchmark suite loops. By employing the aforementioned optimizers and conducting an in-depth analysis, we aim to demonstrate the effectiveness and potential of automatic transformation techniques in enhancing the performance and energy efficiency of dynamic programming codes.
Generating Loop Patterns with a Genetic Algorithm and a Probabilistic Cellular Automata Rule
The objective is to find a Cellular Automata (CA) rule that can generate “loop patterns”. A loop pattern is given by ones on a zero background showing loops. In order to find out how loop patterns can be locally defined, tentative loop patterns are generated by a genetic algorithm in a preliminary stage. A set of local matching tiles is designed and checked whether they can produce the aimed loop patterns by the genetic algorithm. After having approved a certain set of tiles, a probabilistic CA rule is designed in a methodical way. Templates are derived from the tiles, which then are used in the CA rule for matching. In order to drive the evolution to the desired patterns, noise is injected if the templates do not match or other constraints are not fulfilled. Simulations illustrate that loops and connected loops can be evolved by the CA rule.
Space-Time Loop Tiling for Dynamic Programming Codes
We present a new space-time loop tiling approach and demonstrate its application for the generation of parallel tiled code of enhanced locality for three dynamic programming algorithms. The technique envisages that, for each loop nest statement, sub-spaces are first generated so that the intersection of them results in space tiles. Space tiles can be enumerated in lexicographical order or in parallel by using the wave-front technique. Then, within each space tile, time slices are formed, which are enumerated in lexicographical order. Target tiles are represented with multiple time slices within each space tile. We explain the basic idea of space-time loop tiling and then illustrate it by means of an example. Then, we present a formal algorithm and prove its correctness. The algorithm is implemented in the publicly available TRACO compiler. Experimental results demonstrate that parallel codes generated by means of the presented approach outperform closely related manually generated ones or those generated by using affine transformations. The main advantage of code generated by means of the presented approach is its enhanced locality due to splitting each larger space tile into multiple smaller tiles represented with time slices.
A methodology pruning the search space of six compiler transformations by addressing them together as one problem and by exploiting the hardware architecture details
Today’s compilers have a plethora of optimizations-transformations to choose from, and the correct choice, order as well parameters of transformations have a significant/large impact on performance; choosing the correct order and parameters of optimizations has been a long standing problem in compilation research, which until now remains unsolved; the separate sub-problems optimization gives a different schedule/binary for each sub-problem and these schedules cannot coexist, as by refining one degrades the other. Researchers try to solve this problem by using iterative compilation techniques but the search space is so big that it cannot be searched even by using modern supercomputers. Moreover, compiler transformations do not take into account the hardware architecture details and data reuse in an efficient way. In this paper, a new iterative compilation methodology is presented which reduces the search space of six compiler transformations by addressing the above problems; the search space is reduced by many orders of magnitude and thus an efficient solution is now capable to be found. The transformations are the following: loop tiling (including the number of the levels of tiling), loop unroll, register allocation, scalar replacement, loop interchange and data array layouts. The search space is reduced (a) by addressing the aforementioned transformations together as one problem and not separately, (b) by taking into account the custom hardware architecture details (e.g., cache size and associativity) and algorithm characteristics (e.g., data reuse). The proposed methodology has been evaluated over iterative compilation and gcc/icc compilers, on both embedded and general purpose processors; it achieves significant performance gains at many orders of magnitude lower compilation time.
Memory-Optimized Wavefront Parallelism on GPUs
Wavefront parallelism is a well-known technique for exploiting the concurrency of applications that execute nested loops with uniform data dependencies. Recent research of such applications, which range from sequence alignment tools to partial differential equation solvers, has used GPUs to benefit from the massively parallel computing resources. To achieve optimal performance, tiling has been introduced as a popular solution to achieve a balanced workload. However, the use of hyperplane tiles increases the cost of synchronization and leads to poor data locality. In this paper, we present a highly optimized implementation of the wavefront parallelism technique that harnesses the GPU architecture. A balanced workload and maximum resource utilization are achieved with an extremely low synchronization overhead. We design the kernel configuration to significantly reduce the minimum number of synchronizations required and also introduce an inter-block lock to minimize the overhead of each synchronization. In addition, shared memory is used in place of the L1 cache. The well-tailored mapping of the operations to the shared memory improves both spatial and temporal locality. We evaluate the performance of our proposed technique for four different applications: sequence alignment, edit distance, summed-area table, and 2D-SOR. The performance results demonstrate that our method achieves speedups of up to six times compared to the previous best-known hyperplane tiling-based GPU implementation.
Parallel Tiled Code for Computing General Linear Recurrence Equations
In this article, we present a technique that allows us to generate parallel tiled code to calculate general linear recursion equations (GLRE). That code deals with multidimensional data and it is computing-intensive. We demonstrate that data dependencies available in an original code computing GLREs do not allow us to generate any parallel code because there is only one solution to the time partition constraints built for that program. We show how to transform the original code to another one that exposes dependencies such that there are two linear distinct solutions to the time partition restrictions derived from these dependencies. This allows us to generate parallel 2D tiled code computing GLREs. The wavefront technique is used to achieve parallelism, and the generated code conforms to the OpenMP C/C++ standard. The experiments that we conducted with the resulting parallel 2D tiled code show that this code is much more efficient than the original serial code computing GLREs. Code performance improvement is achieved by allowing parallelism and better locality of the target code.
Revisiting the Parallel Strategy for DOACROSS Loops
DOACROSS loops are significant parts in many important scientific and engineering applications, which are generally exploited pipeline/wave-front parallelism by loop transformations. However, previous work almost statically performs iterations in parallel threads, thus causing a waste of computing resources in thread synchronization. This paper proposes a brand-new parallel strategy for DOACROSS loops that provides a dynamic task assignment with reduced dependences to achieve wave-front parallelism through loop tiling. The proposed strategy uses a master-slave parallel mode and some customized structures to realize dynamic and flexible parallelization, which effectively avoids threads from waiting in communication. An efficient tile size selection (TSS) approach is also proposed to preserve data reuse in cache for tiled codes. The experimental results show that the proposed parallel strategy obtains good and stable speedups over six typical benchmarks with different problem sizes and different numbers of threads on an Intel ® Xeon ® 32-core server. And it outperforms two static strategies, a barrier-based strategy and a post/wait-based strategy, by 32% and 20% in average performance, respectively. This strategy also yields a better performance than a mutex-based dynamic strategy. Besides, it has been demonstrated that the proposed TSS approach can achieve a near-optimal performance and is comparable with a state-of-the-art TSS approach.