Asset Details
MbrlCatalogueTitleDetail
Do you wish to reserve the book?
\Medium\ LMs of Code in the Era of LLMs: Lessons From StackOverflow
by
Mukherjee, Manisha
, Hellendoorn, Vincent J
in
Large language models
/ Mathematical models
/ Natural language processing
/ Parameters
/ Questions
/ Software engineering
/ Training
2024
Hey, we have placed the reservation for you!
By the way, why not check out events that you can attend while you pick your title.
You are currently in the queue to collect this book. You will be notified once it is your turn to collect the book.
Oops! Something went wrong.
Looks like we were not able to place the reservation. Kindly try again later.
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
Do you wish to request the book?
\Medium\ LMs of Code in the Era of LLMs: Lessons From StackOverflow
by
Mukherjee, Manisha
, Hellendoorn, Vincent J
in
Large language models
/ Mathematical models
/ Natural language processing
/ Parameters
/ Questions
/ Software engineering
/ Training
2024
Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy
We have requested the book for you!
Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.
Oops! Something went wrong.
Looks like we were not able to place your request. Kindly try again later.
\Medium\ LMs of Code in the Era of LLMs: Lessons From StackOverflow
Paper
\Medium\ LMs of Code in the Era of LLMs: Lessons From StackOverflow
2024
Request Book From Autostore
and Choose the Collection Method
Overview
Large pre-trained neural language models have brought immense progress to both NLP and software engineering. Models in OpenAI's GPT series now dwarf Google's BERT and Meta's RoBERTa, which previously set new benchmarks on a wide range of NLP applications. These models are trained on massive corpora of heterogeneous data from web crawls, which enables them to learn general language patterns and semantic relationships. However, the largest models are both expensive to train and deploy and are often closed-source, so we lack access to their data and design decisions. We argue that this trend towards large, general-purpose models should be complemented with single-purpose, more modestly sized pre-trained models. In this work, we take StackOverflow (SO) as a domain example in which large volumes of rich aligned code and text data is available. We adopt standard practices for pre-training large language models, including using a very large context size (2,048 tokens), batch size (0.5M tokens) and training set (27B tokens), coupled with a powerful toolkit (Megatron-LM), to train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $\\$187\\( and \\)\\$800$ each. We compare the performance of our models with both the previous SOTA model trained on SO data exclusively as well general-purpose BERT models and OpenAI's ChatGPT on four SO-specific downstream tasks - question quality prediction, closed question prediction, named entity recognition and obsoletion prediction (a new task we introduce). Not only do our models consistently outperform all baselines, the smaller model is often sufficient for strong results. Both models are released to the public. These results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.
Publisher
Cornell University Library, arXiv.org
MBRLCatalogueRelatedBooks
Related Items
Related Items
This website uses cookies to ensure you get the best experience on our website.