Asset Details

MbrlCatalogueTitleDetail

Do you wish to reserve the book?

Towards Efficient Transformer Scaling

by Xue, Fuzhao

in Artificial intelligence / Deep learning / Language / Large language models / Machine learning / Natural language / Neural networks

2024

Yes Please

Hey, we have placed the reservation for you!

By the way, why not check out events that you can attend while you pick your title.

Oops! Something went wrong.

Looks like we were not able to place the reservation. Kindly try again later.

Dissertation

Towards Efficient Transformer Scaling

Xue, Fuzhao

2024

Overview

In recent years, Transformer-based deep learning models have exhibited remarkable performance across a myriad of tasks. A pivotal advantage of the Transformer architecture lies in its scalability, spanning dimensions such as dataset size, parameter count, and computational budget. This scaling capability empowers Transformers to attain substantial improvements and even unlock novel capabilities, enabling the accomplishment of tasks previously deemed impossible. However, the pursuit of scaling comes at a considerable cost, limiting the progress of deep learning due to resource constraints. This thesis addresses this challenge by exploring a series of strategies to enhance the efficiency of Transformer scaling. Firstly, the introduction of more trainable parameters can significantly enhance performance but demands increased memory usage. To address this trade-off, we present WideNet, a model that optimizes parameter efficiency by leveraging parameter-sharing and Mixture-of-Experts, achieving superior results in both computer vision and natural language tasks.Secondly, when training different transformer models with distinct objectives at the same scale, we often adopt uniform configurations, such as width and depth. Our investigation into the relationship between transformer configuration and training objectives reveals that token-level training aligns better with deeper and narrower configurations, while sequence-level training encounters challenges in scaling depth due to over-smoothing.Motivated by real-world applications requiring processing of lengthy input sequences (e.g.,, document understanding and medicinal image processing), we focus on scaling the transformer along sequence length from a training system perspective. Our sequence parallelism approach achieves a 27 × increase in maximum sequence length compared to previous methodologies.Transformers face limitations in handling fixed computation budgets at each scale, necessitating the deployment of multiple models at different scales to cater to diverse service levels. To address this, we introduce AdaTape, which enables adaptive computation with elastic input sequences, offering an improved cost-effectiveness trade-off and greater flexibility in utilizing foundation models. Lastly, recent insights from the transformer scaling community highlight the underestimated significance of dataset size. Rather than scaling trainable parameters faster than the dataset, achieving compute-optimal results requires a proportional scaling of model parameters and training tokens. Our exploration into dataset scaling reveals potential limitations in further scaling up large language models, prompting ongoing research into this emerging challenge.

Share this book

Add to My Shelf

Publisher

ProQuest Dissertations & Theses

Subject

Artificial intelligence

/ Deep learning

/ Language

/ Large language models

/ Machine learning

/ Natural language

/ Neural networks