Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
43 result(s) for "Castrillon, Jeronimo"
Sort by:
Compile- and run-time approaches for the selection of efficient data structures for dynamic graph analysis
Graphs are used to model a wide range of systems from different disciplines including social network analysis, biology, and big data processing. When analyzing these constantly changing dynamic graphs at a high frequency, performance is the main concern. Depending on the graph size and structure, update frequency, and read accesses of the analysis, the use of different data structures can yield great performance variations. Even for expert programmers, it is not always obvious, which data structure is the best choice for a given scenario. In previous work, we presented an approach for handling the selection of the most efficient data structures automatically using a compile-time approach well-suited for constant workloads. We extend this work with a measurement study of seven data structures and use the results to fit actual cost estimation functions. In addition, we evaluate our approach for the computations of seven different graph metrics. In analyses of real-world dynamic graphs with a constant workload, our approach achieves a speedup of up to 5.4× compared to basic data structure configurations. Such a compile-time based approach cannot yield optimal results when the behavior of the system changes later and the workload becomes non-constant. To close this gap we present a run-time approach which provides live profiling and facilitates automatic exchanges of data structures during execution. We analyze the performance of this approach using an artificial, non-constant workload where our approach achieves speedups of up to 7.3× compared to basic configurations.
Component-based waveform development: the Nucleus tool flow for efficient and portable software defined radio
With the advent of multi-processor systems on chip (MPSoCs) and due to the complexity and variety of modern wireless standards, academia and industry are moving towards software defined radio (SDR) solutions. It is the goal of the SDR approach to allow designers to describe a radio standard or waveform by means of a high level language. This allows faster waveform development cycles and makes it easier to migrate waveforms across different platforms. Out of many software paradigms, component-based software engineering (CBSE) is an attractive match for SDR, especially for baseband applications. It abstracts waveforms in the traditional way algorithm designers think of their applications and guarantees a high degree of portability. However, existing CBSE approaches for SDR have not been able to close the gap between specification and implementation so as to achieve the computational performance and the energy efficiency of handcrafted solutions. The main reason for this gap is that these flows rely on traditional compilers to lower the high level specification to the platform. The work presented in this paper builds on the Nucleus Concept (Ramakrishnan et al., IEEE Military Communications Conference (MILCOM 2009) [28]) in which computationally intensive kernels and their implementation characteristics on the target platform are known. This information allows a tool to close the performance gap, and thus enables efficient component-based SDR development. In this paper we present such a flow and its supporting environment, which includes state-of-the-art tools for system level design. The flow is demonstrated on a MIMO OFDM transceiver.
MING: An Automated CNN-to-Edge MLIR HLS framework
Driven by the increasing demand for low-latency and real-time processing, machine learning applications are steadily migrating toward edge computing platforms, where Field-Programmable Gate Arrays (FPGAs) are widely adopted for their energy efficiency compared to CPUs and GPUs. To generate high-performance and low-power FPGA designs, several frameworks built upon High Level Synthesis (HLS) vendor tools have been proposed, among which MLIR-based frameworks are gaining significant traction due to their extensibility and ease of use. However, existing state-of-the-art frameworks often overlook the stringent resource constraints of edge devices. To address this limitation, we propose MING, an Multi-Level Intermediate Representation (MLIR)-based framework that abstracts and automates the HLS design process. Within this framework, we adopt a streaming architecture with carefully managed buffers, specifically designed to handle resource constraints while ensuring low-latency. In comparison with recent frameworks, our approach achieves on average 15x speedup for standard Convolutional Neural Network (CNN) kernels with up to four layers, and up to 200x for single-layer kernels. For kernels with larger input sizes, MING is capable of generating efficient designs that respect hardware resource constraints, whereas state-of-the-art frameworks struggle to meet.
CoMoNM: A Cost Modeling Framework for Compute-Near-Memory Systems
Compute-Near-Memory (CNM) systems offer a promising approach to mitigate the von Neumann bottleneck by bringing computational units closer to data. However, optimizing for these architectures remains challenging due to their unique hardware and programming models. Existing CNM compilers often rely on manual programmer annotations for offloading and optimizations. Automating these decisions by exploring the optimization space, common in CPU/GPU systems, is difficult for CNMs as constructing and navigating the transformation space is tedious and time consuming. This is particularly the case during system-level design, where evaluation requires time-consuming simulations. To address this, we present CoMoNM, a generic cost modeling framework for CNM systems for execution time estimation in milliseconds. It takes a high-level, hardware-agnostic application representation, target system specifications, and a mapping specification as input and estimates the execution time for the given application on the target CNM system. We show how CoMoNM can be seamlessly integrated into state-of-the-art CNM compilers, providing improved offloading decisions. Evaluation on established benchmarks for CNM shows estimation errors within 7.80% and 2.99%, when compared to the real UPMEM CNM system and Samsung's HBM-PIM simulator. Notably, CoMoNM delivers estimates seven orders of magnitude faster compared to the UPMEM and HBM-PIM simulators.
Efficient Implementation of Application-Aware Spinlock Control in MPSoCs
Spinlocks are a common technique in Multi-Processor Systems-on-Chip (MPSoCs) to protect shared resources and prevent data corruption. Without a priori application knowledge, the control of spinlocks is often highly random which can degrade the system performance significantly. To improve this, a centralized control mechanism for spinlocks is proposed in this paper, which utilizes application-specific information during spinlock control. The complete control flow is presented, which starts from integrating high-level user-defined information down to a low-level realization of the control. An Application-Specific Instruction-set Processor (ASIP) called OSIP, which was originally designed for task scheduling and mapping, is extended to support this mechanism. The case studies demonstrate the high efficiency of the proposed approach and at the same time highlight the efficiency and flexibility advantages of using an ASIP as the system controller in MPSoCs.
Spiking and Event-driven Neuromorphic Mamba Models for Efficient Speech Recognition
Deep learning has greatly advanced automatic speech recognition (ASR), enabling widespread deployment on edge devices such as smartphones and smart home systems. However, the computational and energy demands of deep neural networks pose significant challenges for such resource-constrained deployments, introducing latency and limiting real-time interaction. Neuromorphic computing offers a promising solution by introducing activation sparsity through spiking neural networks (SNNs) and event-driven neural networks, converting dense operations into sparse computations. However, a study that evaluates the hardware benefits of different neuromorphic strategies remains lacking for ASR. This paper explores spiking and event-driven neuromorphic neural networks to improve activation sparsity in the state-of-the-art SpeechMamba model for ASR. We introduce an event-driven SpeechMamba with FATReLU activation, achieving over 60% activation sparsity with less than 1% accuracy degradation on LibriSpeech. We also propose a spiking SpeechMamba that attains over 70% sparsity while using 30% fewer parameters than comparable SNNs. Finally, we develop a cycle-accurate event-driven simulator enabling flexible algorithm-hardware co-exploration, which helps us identify computational bottlenecks and yields over 10% additional efficiency improvements.
Demonstrating a Future for MLIR-native DSL Compilers on a NumPy-like Example
Compilers for general-purpose languages have been shown to be at a disadvantage when it comes to specialized application domains as opposed to their Domain-Specific Language (DSL) counterparts. However, the field of DSL compilers features little consolidation in terms of compiler frameworks and adjacent software ecosystems. As a result, considerable work is duplicated, lost to maintenance issues, or remains undiscovered, and most DSLs are never considered \"production-ready\". One notable development is the introduction of the Multi-Level Intermediate Representation (MLIR), which promises a similar impact on DSL compilers as LLVM had on general-purpose tooling. In this work, we present a NumPy-like DSL made for offloading numeric tensor kernels that is entirely MLIR-native. In a first for open-source, it implements all frontend actions and semantic analyses directly within MLIR. Most notably, this is made possible by our new dialect-agnostic MLIR type checker, created for the future of DSLs in MLIR. We implement a simple, yet effective, parallel-first lowering scheme that connects our language to another MLIR dataflow dialect for seamless offloading. We show that our approach performs well in real-world use cases from the domain of weather modeling and Computational Fluid Dynamics (CFD) in Fortran.
Leveraging Stochastic Depth Training for Adaptive Inference
Dynamic DNN optimization techniques such as layer-skipping offer increased adaptability and efficiency gains but can lead to i) a larger memory footprint as in decision gates, ii) increased training complexity (e.g., with non-differentiable operations), and iii) less control over performance-quality trade-offs due to its inherent input-dependent execution. To approach these issues, we propose a simpler yet effective alternative for adaptive inference with a zero-overhead, single-model, and time-predictable inference. Central to our approach is the observation that models trained with Stochastic Depth -- a method for faster training of residual networks -- become more resilient to arbitrary layer-skipping at inference time. We propose a method to first select near Pareto-optimal skipping configurations from a stochastically-trained model to adapt the inference at runtime later. Compared to original ResNets, our method shows improvements of up to 2X in power efficiency at accuracy drops as low as 0.71%.
Optimized Communication Architecture of MPSoCs with a Hardware Scheduler: A System-Level Analysis
Efficient runtime resource management in multi-processor systems-on-chip (MPSoCs) for achieving high performance and low energy consumption is one of the key challenges for system designers. OSIP, an operating system application-specific instruction-set processor, together with its well-defined programming model, provides a promising solution. It delivers high computational performance to deal with dynamic task scheduling and mapping. Being programmable, it can easily be adapted to different systems. However, the distributed computation among the different processing elements introduces complexity to the communication architecture, which tends to become the bottleneck of such systems. In this work, the authors highlight the vital importance of the communication architecture for OSIP-based systems and optimize the communication architecture. Furthermore, the effects of OSIP and the communication architecture are investigated jointly from the system point of view, based on a broad case study for a real life application (H.264) and a synthetic benchmark application.
Energy-efficient Runtime Resource Management for Adaptable Multi-application Mapping
Modern embedded computing platforms consist of a high amount of heterogeneous resources, which allows executing multiple applications on a single device. The number of running application on the system varies with time and so does the amount of available resources. This has considerably increased the complexity of analysis and optimization algorithms for runtime mapping of firm real-time applications. To reduce the runtime overhead, researchers have proposed to pre-compute partial mappings at compile time and have the runtime efficiently compute the final mapping. However, most existing solutions only compute a fixed mapping for a given set of running applications, and the mapping is defined for the entire duration of the workload execution. In this work we allow applications to adapt to the amount of available resources by using mapping segments. This way, applications may switch between different configurations with varied degree of parallelism. We present a runtime manager for firm real-time applications that generates such mapping segments based on partial solutions and aims at minimizing the overall energy consumption without deadline violations. The proposed algorithm outperforms the state-of-the-art approaches on the overall energy consumption by up to 13% while incurring an order of magnitude less scheduling overhead.