Asset Details

MbrlCatalogueTitleDetail

Do you wish to reserve the book?

$\Medium\ LMs of Code in the Era of LLMs: Lessons From StackOverflow$

\Medium\ LMs of Code in the Era of LLMs: Lessons From StackOverflow

by Mukherjee, Manisha , Hellendoorn, Vincent J

in Large language models / Mathematical models / Natural language processing / Parameters / Questions / Software engineering / Training

2024

Yes Please

Hey, we have placed the reservation for you!

By the way, why not check out events that you can attend while you pick your title.

Oops! Something went wrong.

Looks like we were not able to place the reservation. Kindly try again later.

$\Medium\ LMs of Code in the Era of LLMs: Lessons From StackOverflow$

Paper

\Medium\ LMs of Code in the Era of LLMs: Lessons From StackOverflow

Mukherjee, Manisha,

Hellendoorn, Vincent J

2024

Overview

Large pre-trained neural language models have brought immense progress to both NLP and software engineering. Models in OpenAI's GPT series now dwarf Google's BERT and Meta's RoBERTa, which previously set new benchmarks on a wide range of NLP applications. These models are trained on massive corpora of heterogeneous data from web crawls, which enables them to learn general language patterns and semantic relationships. However, the largest models are both expensive to train and deploy and are often closed-source, so we lack access to their data and design decisions. We argue that this trend towards large, general-purpose models should be complemented with single-purpose, more modestly sized pre-trained models. In this work, we take StackOverflow (SO) as a domain example in which large volumes of rich aligned code and text data is available. We adopt standard practices for pre-training large language models, including using a very large context size (2,048 tokens), batch size (0.5M tokens) and training set (27B tokens), coupled with a powerful toolkit (Megatron-LM), to train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $\\$187\$ and \$\\$800$ each. We compare the performance of our models with both the previous SOTA model trained on SO data exclusively as well general-purpose BERT models and OpenAI's ChatGPT on four SO-specific downstream tasks - question quality prediction, closed question prediction, named entity recognition and obsoletion prediction (a new task we introduce). Not only do our models consistently outperform all baselines, the smaller model is often sufficient for strong results. Both models are released to the public. These results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.

Share this book

Add to My Shelf

Publisher

Cornell University Library, arXiv.org

Subject

Large language models

/ Mathematical models

/ Natural language processing

/ Parameters

/ Questions

/ Software engineering

/ Training