MbrlCatalogueTitleDetail

Do you wish to reserve the book?
Fine-Grained Paging Mechanism for Offloading-Reloading Tensor for LLM
Fine-Grained Paging Mechanism for Offloading-Reloading Tensor for LLM
Hey, we have placed the reservation for you!
Hey, we have placed the reservation for you!
By the way, why not check out events that you can attend while you pick your title.
You are currently in the queue to collect this book. You will be notified once it is your turn to collect the book.
Oops! Something went wrong.
Oops! Something went wrong.
Looks like we were not able to place the reservation. Kindly try again later.
Are you sure you want to remove the book from the shelf?
Fine-Grained Paging Mechanism for Offloading-Reloading Tensor for LLM
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
Title added to your shelf!
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Do you wish to request the book?
Fine-Grained Paging Mechanism for Offloading-Reloading Tensor for LLM
Fine-Grained Paging Mechanism for Offloading-Reloading Tensor for LLM

Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy
How would you like to get it?
We have requested the book for you! Sorry the robot delivery is not available at the moment
We have requested the book for you!
We have requested the book for you!
Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.
Oops! Something went wrong.
Oops! Something went wrong.
Looks like we were not able to place your request. Kindly try again later.
Fine-Grained Paging Mechanism for Offloading-Reloading Tensor for LLM
Fine-Grained Paging Mechanism for Offloading-Reloading Tensor for LLM
Dissertation

Fine-Grained Paging Mechanism for Offloading-Reloading Tensor for LLM

2025
Request Book From Autostore and Choose the Collection Method
Overview
The rapid growth in the size and complexity of large language models has im-posed severe challenges on memory management, particularly when these models are deployed on GPUs with limited memory. This thesis introduces a fine-grained paging mechanism that dynamically offloads and reloads tensors at the granularity of individual operations, thereby mitigating out-of-memory (OOM) issues during the inference and prefill phase of transformer-based models. Instead of traditional static, layer-based offloading methods, the proposed approach uses compile-time, simulation-based memory allocation to optimize GPU memory usage, making runtime possible under severe memory constraints.This work is based off of the Einsummable system, a framework that represents tensor computations using Einstein summation notation. Einsummable transforms high-level mathematical specifications into an optimized execution pipeline through a series of intermediate representations, notably the TASKGRAPH and the MEMGRAPH. The TASKGRAPH captures the data dependencies and operational flow of tensor computations, while the MEMGRAPH extends this representation by incorporating detailed memory location information and managing offload-reload operations. The transformation from TASKGRAPH to MEMGRAPH is achieved through a simulated execution process the core of this thesis that relies on two key components: an allocation horizon, which pre-allocates memory for future operations, and an execution horizon, which tracks the simulated execution progress of the computation.A key contribution of this thesis is the design and implementation of specialized memory allocation routines simMalloc, simMallocForceReld, and simMallocOffld. These routines not only allocate memory for tensor outputs but also manage dependencies by inserting offload and reload nodes into the MEMGRAPH whenever GPU memory resources depletes. By leveraging full knowledge of the simulated execution order, our offload-reload heuristic selects tensors for offloading based on their computed reuse distance, thereby deferring memory transfers until they are most convenient. This future-aware strategy mitigates the frequency and impact of memory transfers compared to reactive approaches, enabling a finer control over GPU memory usage.Extensive experimental evaluations were conducted using two configurations of NVIDIA GPUs Tesla P100 and V100 to benchmark the performance of the proposed system against state-of-the-art techniques such as ZERO-Inference. The evaluation focused on the prefill stage of inference in LLAMA models with 7B and 65B parameters, a phase known to be particularly memory-bound. The results demonstrate that the fine-grained paging mechanism supports a broader range of configurations, successfully executing inference tasks across varying batch sizes and sequence lengths. While the finer granularity of tensor-level management introduces some communication overhead due to more frequent offloading and reloading, the overall improvements in memory utilization and reduction in OOM errors outweigh these costs.In summary, this thesis makes a contribution to the field of deep learning by addressing the critical challenge of GPU memory constraints through a fine-grained paging mechanism. Future work will explore further optimizations to reduce communication overhead, overall computation latency, and GPU RAM utilization.
Publisher
ProQuest Dissertations & Theses
ISBN
9798290947051