MbrlCatalogueTitleDetail

Do you wish to reserve the book?
Towards Efficient Transformer Scaling
Towards Efficient Transformer Scaling
Hey, we have placed the reservation for you!
Hey, we have placed the reservation for you!
By the way, why not check out events that you can attend while you pick your title.
You are currently in the queue to collect this book. You will be notified once it is your turn to collect the book.
Oops! Something went wrong.
Oops! Something went wrong.
Looks like we were not able to place the reservation. Kindly try again later.
Are you sure you want to remove the book from the shelf?
Towards Efficient Transformer Scaling
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
Title added to your shelf!
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Do you wish to request the book?
Towards Efficient Transformer Scaling
Towards Efficient Transformer Scaling

Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy
How would you like to get it?
We have requested the book for you! Sorry the robot delivery is not available at the moment
We have requested the book for you!
We have requested the book for you!
Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.
Oops! Something went wrong.
Oops! Something went wrong.
Looks like we were not able to place your request. Kindly try again later.
Towards Efficient Transformer Scaling
Towards Efficient Transformer Scaling
Dissertation

Towards Efficient Transformer Scaling

2024
Request Book From Autostore and Choose the Collection Method
Overview
In recent years, Transformer-based deep learning models have exhibited remarkable performance across a myriad of tasks. A pivotal advantage of the Transformer architecture lies in its scalability, spanning dimensions such as dataset size, parameter count, and computational budget. This scaling capability empowers Transformers to attain substantial improvements and even unlock novel capabilities, enabling the accomplishment of tasks previously deemed impossible. However, the pursuit of scaling comes at a considerable cost, limiting the progress of deep learning due to resource constraints. This thesis addresses this challenge by exploring a series of strategies to enhance the efficiency of Transformer scaling. Firstly, the introduction of more trainable parameters can significantly enhance performance but demands increased memory usage. To address this trade-off, we present WideNet, a model that optimizes parameter efficiency by leveraging parameter-sharing and Mixture-of-Experts, achieving superior results in both computer vision and natural language tasks.Secondly, when training different transformer models with distinct objectives at the same scale, we often adopt uniform configurations, such as width and depth. Our investigation into the relationship between transformer configuration and training objectives reveals that token-level training aligns better with deeper and narrower configurations, while sequence-level training encounters challenges in scaling depth due to over-smoothing.Motivated by real-world applications requiring processing of lengthy input sequences (e.g.,, document understanding and medicinal image processing), we focus on scaling the transformer along sequence length from a training system perspective. Our sequence parallelism approach achieves a 27 × increase in maximum sequence length compared to previous methodologies.Transformers face limitations in handling fixed computation budgets at each scale, necessitating the deployment of multiple models at different scales to cater to diverse service levels. To address this, we introduce AdaTape, which enables adaptive computation with elastic input sequences, offering an improved cost-effectiveness trade-off and greater flexibility in utilizing foundation models. Lastly, recent insights from the transformer scaling community highlight the underestimated significance of dataset size. Rather than scaling trainable parameters faster than the dataset, achieving compute-optimal results requires a proportional scaling of model parameters and training tokens. Our exploration into dataset scaling reveals potential limitations in further scaling up large language models, prompting ongoing research into this emerging challenge.
Publisher
ProQuest Dissertations & Theses
ISBN
9798291511992