Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
LanguageLanguage
-
SubjectSubject
-
Item TypeItem Type
-
DisciplineDiscipline
-
YearFrom:-To:
-
More FiltersMore FiltersIs Peer Reviewed
Done
Filters
Reset
804
result(s) for
"CUDA"
Sort by:
CUDA programming : a developer's guide to parallel computing with GPUs
2013,2012
If you need to learn CUDA but don't have experience with parallel computing, CUDA Programming: A Developer's Introduction offers a detailed guide to CUDA with a grounding in parallel fundamentals. It starts by introducing CUDA and bringing you up to speed on GPU parallelism and hardware, then delving into CUDA installation. Chapters on core concepts including threads, blocks, grids, and memory focus on both parallel and CUDA-specific issues. Later, the book demonstrates CUDA in practice for optimizing applications, adjusting to new hardware, and solving common problems. Comprehensive introduction to parallel programming with CUDA, for readers new to bothDetailed instructions help readers optimize the CUDA software development kitPractical techniques illustrate working with memory, threads, algorithms, resources, and moreCovers CUDA on multiple hardware platforms: Mac, Linux and Windows with several NVIDIA chipsetsEach chapter includes exercises to test reader knowledge
Development of a 3D Hybrid Finite-Discrete Element Simulator Based on GPGPU-Parallelized Computation for Modelling Rock Fracturing Under Quasi-Static and Dynamic Loading Conditions
by
Fukuda Daisuke
,
Liu, Hongyuan
,
Fujii Yoshiaki
in
Agreements
,
Algorithms
,
Compressive strength
2020
As a state-of-the-art computational method for simulating rock fracturing and fragmentation, the combined finite-discrete element method (FDEM) has become widely accepted since Munjiza (2004) published his comprehensive book of FDEM. This study developed a general-purpose graphic-processing-unit (GPGPU)-parallelized FDEM using the compute unified device architecture C/C ++ based on the authors’ former sequential two-dimensional (2D) and three-dimensional (3D) Y-HFDEM IDE (integrated development environment) code. The theory and algorithm of the GPGPU-parallelized 3D Y-HFDEM IDE code are first introduced by focusing on the implementation of the contact detection algorithm, which is different from that in the sequential code, contact damping and contact friction. 3D modelling of the failure process of limestone under quasi-static loading conditions in uniaxial compressive strength (UCS) tests and Brazilian tensile strength (BTS) tests are then conducted using the GPGPU-parallelized 3D Y-HFDEM IDE code. The 3D FDEM modelling results show that mixed-mode I–II failures are the dominant failure mechanisms along the shear and splitting failure planes in the UCS and BTS models, respectively, with unstructured meshes. Pure mode I splitting failure planes and pure mode II shear failure planes are only possible in the UCS and BTS models, respectively, with structured meshes. Subsequently, 3D modelling of the dynamic fracturing of marble in dynamic Brazilian tests with a split Hopkinson pressure bar (SHPB) apparatus is conducted using the GPGPU-parallelized 3D HFDEM IDE code considering the entire SHPB testing system. The modelled failure process, final fracture pattern and time histories of the dynamic compressive wave, reflective tensile wave and transmitted compressive wave are compared quantitatively and qualitatively with those from experiments, and good agreements are achieved between them. The computing performance analysis shows the GPGPU-parallelized 3D HFDEM IDE code is 284 times faster than its sequential version and can achieve the computational complexity of O(N). The results demonstrate that the GPGPU-parallelized 3D Y-HFDEM IDE code is a valuable and powerful numerical tool for investigating rock fracturing under quasi-static and dynamic loading conditions in rock engineering applications although very fine elements with maximum element size no bigger than the length of the fracture process zone must be used in the area where fracturing process is modelled.
Journal Article
GPGPU Programming for Dipolar Field Calculation
2024
Accelerating computational processes is paramount in numerical infrastructure development, particularly in applications such as the finite element method (FEM) and extensive calculations for simulating 3D processes in materials. In this work, we introduce a novel technique for computing the magnetostatic field of an ellipsoid particle, leveraging CUDA on a graphical card for parallel processing. The implementation on a GPU resulted in a remarkable 20-fold improvement in calculation speed. This achievement not only expedites research tasks, but also enables the exploration of larger and more intricate simulations, facilitating quicker model refinements and deeper insights into material behaviours under various conditions. The utilization of GPU computing aligns with the broader trend in scientific research and engineering, offering a versatile solution for diverse computational challenges beyond this specific task of magnetism. Overall, our work contributes to the ongoing effort to harness high-performance computing (HPC) technologies for accelerated and more efficient simulations in materials science and related fields.
Journal Article
Accelerating k-Means on GPU with CUDA Programming
2020
We accelerate basic k-Means algorithm using CUDA GPU, a new programming model by NVIDIA, and experiment data shows we achieve a maximum speedup of 67.752, while other teams claim 20 to 40. Also we find that the basic k-Means algorithm is most sensitive to the cluster size k, and less to the datasets size b and least to the dimension d. In addition, we find the CUDA shared memory improves the performance, but also depends on which factor we scale.
Journal Article
RECENT CAPABILITY AND PERFORMANCE ENHANCEMENTS OF THE WHOLE-CORE TRANSPORT CODE nTRACER
2021
The whole-core transport code nTRACER has made many advances in recent years. Several innovative cross section treatment methods were developed, a new axial transport solver was introduced for stabilizing the 2D/1D scheme, and substantial computational enhancements were achieved using NVIDIA CUDA and Intel Math Kernel Library (MKL). In addition, gamma transport solver was implemented to predict the power distributions more physically, and the flexibility of the restart calculation was improved using an offline processing code nTIG (nTRACER Input Generator). This paper is the compilation of the recent progresses in nTRACER developments.
Journal Article
A CUDA-based parallel optimization method for SM3 hash algorithm
2024
Hash algorithms are among the most crucial algorithms in cryptography. The SM3 algorithm is a hash cryptographic standard of China. Because of the strong collision resistance and irreversibility of hash algorithms, they are widely used as a basic function in various fields such as digital signatures and random number generation. With the increasing real-time applications of automation in the fields of finance and office, the network puts forward higher demands for the implementing efficiency of the SM3 algorithm. We present a CUDA-based parallel optimized method for SM3 algorithm by four different ways: They are Single data stream with Single thread (SS), Multiple data streams with Single thread (MS), Single data stream with Multi-thread (SM), and Multiple data streams with Multi-thread (MM). The experimental result shows MM is the best of the four. When considering the data transmission between CPU and GPU, the proposed optimized algorithm achieves a peak performance of 166.42 Gb/s, which is 1.96 times of the best-known implementation of the SM3 algorithm on GPU platforms. Without transmission time counting, the peak performance is near 8500 Gb/s. Compared with other SM3 GPU algorithms, the algorithm proposed in this paper significantly enhances the efficiency of digest generation. Furthermore, the results show a new conclusion that the optimization of logical operations in the SM3 algorithm has reached a very high extent and the data transmission of PCIE becomes the bottleneck in the CPU+GPU data processing mode. Therefore, future work on the optimization of the SM3 algorithm should pay more attention to the PCIE data transfer efficiency.
Journal Article
Accelerating the Lagrangian Particle Tracking in Hydrologic Modeling to Continental‐Scale
2023
Unprecedented climate change and anthropogenic activities have induced increasing ecohydrological problems, motivating the development of large‐scale hydrologic modeling for solutions. Water age/quality is as important as water quantity for understanding the terrestrial water cycle. However, scientific progress in tracking water parcels at large‐scale with high spatiotemporal resolutions is far behind that in simulating water balance/quantity owing to the lack of powerful modeling tools. EcoSLIM is a particle tracking model working with ParFlow‐CLM that couples integrated surface‐subsurface hydrology with land surface processes. Here, we demonstrate a parallel framework on distributed, multi‐Graphics Processing Unit platforms with Compute Unified Device Architecture‐Aware Message Passing Interface for accelerating EcoSLIM to continental‐scale. In tests from catchment‐, to regional‐, and then to continental‐scale using 25‐million to 1.6‐billion particles, EcoSLIM shows significant speedup and excellent parallel performance. The parallel framework is portable to atmospheric and oceanic particle tracking models, where the parallelization is inadequate, and a standard parallel framework is also absent. The parallelized EcoSLIM is a promising tool to accelerate our understanding of the terrestrial water cycle and the upscaling of subsurface hydrology to Earth System Models. Plain Language Summary Studies of water ages at multiple spatiotemporal scales are urgent to better understand the connections between different hydrologic compartments. Climate change and anthropogenic activities make this requirement more pressing. Lagrangian particle tracking is a powerful tool to simulate water ages. However, it is computationally demanding, which hampers its wide application. In this study, we provide a Lagrangian particle tracking model, EcoSLIM, with a novel parallel framework that enables it to handle large‐scale water age simulations with high spatiotemporal resolutions. We combined the efforts of engineers and scientists from multiple disciplines on this work which cannot be achieved by the knowledge of an individual discipline. To the best of our knowledge, such a modeling tool is absent in communities of hydrology and Earth Surface Processes. In tests from catchment‐, to regional‐, and then to continental‐scale using 25‐million to 1.6‐billion particles, EcoSLIM shows significant speedup and excellent parallel performance. Although we take EcoSLIM as an example here, the parallel framework is portable to other particle tracking models in Earth System Science, such as those in atmospheric and oceanic disciplines. The parallelized EcoSLIM is a promising tool to hydrologic community and Earth System Model developers for scientific exploration. Key Points Numerical models for large‐scale water age/quality simulations are absent in communities of hydrology and Earth Surface Processes A parallel framework for accelerating Lagrangian particle tracking to continental‐scale on distributed, multi‐Graphics Processing Unit platforms is established The parallelized particle tracking model, EcoSLIM, is a promising tool to accelerate our understanding of the terrestrial water cycle
Journal Article
CUDA application design and development
by
Farber, Rob
in
Application software
,
Application software -- Development
,
Computer architecture
2011
As the computer industry retools to leverage massively parallel graphics processing units (GPUs), this book is designed to meet the needs of working software developers who need to understand GPU programming with CUDA and increase efficiency in their projects. CUDA Application Design and Development starts with an introduction to parallel computing concepts for readers with no previous parallel experience, and focuses on issues of immediate importance to working software developers: achieving high performance, maintaining competitiveness, analyzing CUDA benefits versus costs, and determining application lifespan.The book then details the thought behind CUDA and teaches how to create, analyze, and debug CUDA applications. Throughout, the focus is on software engineering issues: how to use CUDA in the context of existing application code, with existing compilers, languages, software tools, and industry-standard API libraries.Using an approach refined in a series of well-received articles at Dr Dobb's Journal, author Rob Farber takes the reader step-by-step from fundamentals to implementation, moving from language theory to practical coding.Includes multiple examples building from simple to more complex applications in four key areas: machine learning, visualization, vision recognition, and mobile computingAddresses the foundational issues for CUDA development: multi-threaded programming and the different memory hierarchyIncludes teaching chapters designed to give a full understanding of CUDA tools, techniques and structure.Presents CUDA techniques in the context of the hardware they are implemented on as well as other styles of programming that will help readers bridge into the new material
Efficient GPU Parallel Implementation and Optimization of ARIA for Counter and Exhaustive Key-Search Modes
2025
This paper proposes an optimized shared memory access technique to enhance parallel processing performance and reduce memory accesses for the ARIA block cipher in GPU environments. To overcome the limited size of GPU shared memory, we merged ARIA’s four separate S-box tables into a single unified 32-bit table, effectively reducing the total memory usage from 4 KB to 1 KB. This allowed the consolidated table to be replicated 32 times within the limited shared memory, efficiently resolving the memory-bank conflict issues frequently encountered during parallel execution. Additionally, we utilized CUDA’s built-in function __byte_perm() to efficiently reconstruct the desired outputs from the reduced unified table, without imposing additional computational overhead. In exhaustive key-search scenarios, we implemented an on-the-fly key-expansion method, significantly reducing the memory usage per thread and enhancing parallel processing efficiency. In the RTX 3060 environment, profiling was performed to accurately analyze shared memory efficiency and the performance degradation caused by bank conflicts, yielding detailed profiling results. The results of experiments conducted on the RTX 3060 Mobile and RTX 4080 GPUs demonstrated significant performance improvements over conventional methods. Notably, the RTX 4080 GPU achieved a maximum throughput of 1532.42 Gbps in ARIA-CTR mode, clearly validating the effectiveness and practical applicability of the proposed optimization techniques. On the RTX 3060, the performance of 128-bit ARIA-CTR was improved by 2.34× compared to previous state-of-the-art implementations. Furthermore, for exhaustive key searches on the 128-bit ARIA block cipher, a throughput of 1365.84 Gbps was achieved on the RTX 4080 GPU.
Journal Article
Load-Swing Attenuation in a Quadcopter–Payload System Through Trajectory Optimisation
by
Khatamianfar, Arash
,
Feng, Barry
in
CUDA-accelerated tag detection
,
Drones
,
load-swing attenuation
2025
Advancements in multi-rotor quadcopter technology and sensing capabilities have led to their increased utilisation for last-mile delivery. However, battery capacity constraints limit their use in extended-distance delivery scenarios. A visual servoing implementation is first proposed that leverages a CUDA-accelerated tag detection algorithm for real-time pose estimation of the target. A new approach is then developed to enhance quadcopter package collection by implementing a control scheme to attenuate aggressive load-swing in a payload arm that shifts from horizontal to vertical after obtaining a vertically mounted payload. The motion of the payload arm imposes a shift in the system’s centre of mass, leading to a possible instability. A non-linear control scheme is then introduced to address this problem through attenuation of the residual energy from payload oscillation. The performance of the visual servoing approach is validated through both numerical simulations and a physical quadcopter implementation, along with the performance of the load-swing attenuation through numerical simulations.
Journal Article