Asset Details

MbrlCatalogueTitleDetail

Do you wish to reserve the book?

WARP: On the Benefits of Weight Averaged Rewarded Policies

by Ramé, Alexandre , Sessa, Pier Giuseppe , Dadashi, Robert , Girgin, Sertan , Léonard Hussenot , Douillard, Arthur , Ferret, Johan , Bachem, Olivier , Vieillard, Nino , Pierre-Louis Cedoz

in Alignment / Large language models / Policies / Regularization / Weight

2024

Yes Please

Hey, we have placed the reservation for you!

By the way, why not check out events that you can attend while you pick your title.

Oops! Something went wrong.

Looks like we were not able to place the reservation. Kindly try again later.

Do you wish to request the book?

WARP: On the Benefits of Weight Averaged Rewarded Policies

by Ramé, Alexandre , Sessa, Pier Giuseppe , Dadashi, Robert , Girgin, Sertan , Léonard Hussenot , Douillard, Arthur , Ferret, Johan , Bachem, Olivier , Vieillard, Nino , Pierre-Louis Cedoz

in Alignment / Large language models / Policies / Regularization / Weight

2024

Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy

How would you like to get it?

Submit

We have requested the book for you!

Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.

Oops! Something went wrong.

Looks like we were not able to place your request. Kindly try again later.

Paper

WARP: On the Benefits of Weight Averaged Rewarded Policies

Ramé, Alexandre,

Sessa, Pier Giuseppe,

Dadashi, Robert,

Girgin, Sertan,

Léonard Hussenot,

Douillard, Arthur,

Ferret, Johan,

Bachem, Olivier,

Vieillard, Nino,

Pierre-Louis Cedoz

2024

Overview

Reinforcement learning from human feedback (RLHF) aligns large language models (LLMs) by encouraging their generations to have high rewards, using a reward model trained on human preferences. To prevent the forgetting of pre-trained knowledge, RLHF usually incorporates a KL regularization; this forces the policy to remain close to its supervised fine-tuned initialization, though it hinders the reward optimization. To tackle the trade-off between KL and reward, in this paper we introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP). WARP merges policies in the weight space at three distinct stages. First, it uses the exponential moving average of the policy as a dynamic anchor in the KL regularization. Second, it applies spherical interpolation to merge independently fine-tuned policies into a new enhanced one. Third, it linearly interpolates between this merged model and the initialization, to recover features from pre-training. This procedure is then applied iteratively, with each iteration's final model used as an advanced initialization for the next, progressively refining the KL-reward Pareto front, achieving superior rewards at fixed KL. Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.

Share this book

Add to My Shelf

Publisher

Cornell University Library, arXiv.org

Subject

Alignment

/ Large language models

/ Policies

/ Regularization

/ Weight