Asset Details
MbrlCatalogueTitleDetail
Do you wish to reserve the book?
Fine-Grained Paging Mechanism for Offloading-Reloading Tensor for LLM
by
Yao, Jiawen
in
Artificial intelligence
/ Computer science
/ Engineering
2025
Hey, we have placed the reservation for you!
By the way, why not check out events that you can attend while you pick your title.
You are currently in the queue to collect this book. You will be notified once it is your turn to collect the book.
Oops! Something went wrong.
Looks like we were not able to place the reservation. Kindly try again later.
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
Do you wish to request the book?
Fine-Grained Paging Mechanism for Offloading-Reloading Tensor for LLM
by
Yao, Jiawen
in
Artificial intelligence
/ Computer science
/ Engineering
2025
Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy
We have requested the book for you!
Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.
Oops! Something went wrong.
Looks like we were not able to place your request. Kindly try again later.
Fine-Grained Paging Mechanism for Offloading-Reloading Tensor for LLM
Dissertation
Fine-Grained Paging Mechanism for Offloading-Reloading Tensor for LLM
2025
Request Book From Autostore
and Choose the Collection Method
Overview
The rapid growth in the size and complexity of large language models has im-posed severe challenges on memory management, particularly when these models are deployed on GPUs with limited memory. This thesis introduces a fine-grained paging mechanism that dynamically offloads and reloads tensors at the granularity of individual operations, thereby mitigating out-of-memory (OOM) issues during the inference and prefill phase of transformer-based models. Instead of traditional static, layer-based offloading methods, the proposed approach uses compile-time, simulation-based memory allocation to optimize GPU memory usage, making runtime possible under severe memory constraints.This work is based off of the Einsummable system, a framework that represents tensor computations using Einstein summation notation. Einsummable transforms high-level mathematical specifications into an optimized execution pipeline through a series of intermediate representations, notably the TASKGRAPH and the MEMGRAPH. The TASKGRAPH captures the data dependencies and operational flow of tensor computations, while the MEMGRAPH extends this representation by incorporating detailed memory location information and managing offload-reload operations. The transformation from TASKGRAPH to MEMGRAPH is achieved through a simulated execution process the core of this thesis that relies on two key components: an allocation horizon, which pre-allocates memory for future operations, and an execution horizon, which tracks the simulated execution progress of the computation.A key contribution of this thesis is the design and implementation of specialized memory allocation routines simMalloc, simMallocForceReld, and simMallocOffld. These routines not only allocate memory for tensor outputs but also manage dependencies by inserting offload and reload nodes into the MEMGRAPH whenever GPU memory resources depletes. By leveraging full knowledge of the simulated execution order, our offload-reload heuristic selects tensors for offloading based on their computed reuse distance, thereby deferring memory transfers until they are most convenient. This future-aware strategy mitigates the frequency and impact of memory transfers compared to reactive approaches, enabling a finer control over GPU memory usage.Extensive experimental evaluations were conducted using two configurations of NVIDIA GPUs Tesla P100 and V100 to benchmark the performance of the proposed system against state-of-the-art techniques such as ZERO-Inference. The evaluation focused on the prefill stage of inference in LLAMA models with 7B and 65B parameters, a phase known to be particularly memory-bound. The results demonstrate that the fine-grained paging mechanism supports a broader range of configurations, successfully executing inference tasks across varying batch sizes and sequence lengths. While the finer granularity of tensor-level management introduces some communication overhead due to more frequent offloading and reloading, the overall improvements in memory utilization and reduction in OOM errors outweigh these costs.In summary, this thesis makes a contribution to the field of deep learning by addressing the critical challenge of GPU memory constraints through a fine-grained paging mechanism. Future work will explore further optimizations to reduce communication overhead, overall computation latency, and GPU RAM utilization.
Publisher
ProQuest Dissertations & Theses
Subject
ISBN
9798290947051
This website uses cookies to ensure you get the best experience on our website.