Catalogue Search | MBRL

On Generative Spoken Language Modeling from Raw Audio

by Bolte, Benjamin , Kharitonov, Eugene , Baevski, Alexei in Acoustics , Automatic text generation , Computation and Language

2021

We introduce , the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo- text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder- dependent way, and that some combinations approach text-based systems.

Journal Article

Share this book

Add to My Shelf

Generative Spoken Dialogue Language Modeling

by Nguyen, Tu Anh , Kharitonov, Eugene , Tomasello, Paden in Computation and Language , Computer Science , Conversation

2023

We introduce dGSLM, the first “textless” model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter, and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn taking compared to a text-based cascaded model. ,

Journal Article

Share this book

Add to My Shelf

Audio Conditioning for Music Generation via Discrete Bottleneck Features

by Rouard, Simon , Défossez, Alexandre , Copet, Jade in Conditioning , Music

2024

While most music generation models use textual or parametric conditioning (e.g. tempo, harmony, musical genre), we propose to condition a language model based music generation system with audio input. Our exploration involves two distinct strategies. The first strategy, termed textual inversion, leverages a pre-trained text-to-music model to map audio input to corresponding \"pseudowords\" in the textual embedding space. For the second model we train a music language model from scratch jointly with a text conditioner and a quantized audio feature extractor. At inference time, we can mix textual and audio conditioning and balance them thanks to a novel double classifier free guidance method. We conduct automatic and human studies that validates our approach. We will release the code and we provide music samples on https://musicgenstyle.github.io in order to show the quality of our model.

Paper

Share this book

Add to My Shelf

Short window attention enables long-term memorization

by Jégou, Hervé , Szilvasy, Gergely , Lomeli, Maria in Attention , Sliding

2026

Recent works show that hybrid architectures combining local sliding window attention layers and global attention layers outperform either of these architectures taken separately. However, the impact of the window length and the interplay between local layers and global layers remain under-studied. In this work, we first analyze the interaction between short and long term memory by considering SWAX: a hybrid architecture consisting of sliding-window attention and xLSTM linear RNN layers. A counter-intuitive finding is that larger sliding windows hurts the long-context performance. In fact, short window attention encourages the model to better train the long-term memory of the xLSTM as it cannot rely on the local softmax attention mechanism for long context-retrieval. We also validate our findings on local-global architectures alternating short window and full attention layers: the short layers should be small in order not to hinder the usefulness of the long layers. However, employing too small sliding windows is detrimental even for short-context tasks, which could be solved with information from moderately larger sliding windows otherwise. Therefore, we train hybrid architectures by stochastically changing the sliding window size, forcing the model to leverage both the short term window and the long-term memory. Training with stochastic window sizes significantly outperforms regular window attention both on short and long-context problems.

Paper

Share this book

Add to My Shelf

ASR4REAL: An extended benchmark for speech models

by Synnaeve, Gabriel , Copet, Jade , Riviere, Morgane in Benchmarks , Economic models , Human bias

2021

Popular ASR benchmarks such as Librispeech and Switchboard are limited in the diversity of settings and speakers they represent. We introduce a set of benchmarks matching real-life conditions, aimed at spotting possible biases and weaknesses in models. We have found out that even though recent models do not seem to exhibit a gender bias, they usually show important performance discrepancies by accent, and even more important ones depending on the socio-economic status of the speakers. Finally, all tested models show a strong performance drop when tested on conversational speech, and in this precise context even a language model trained on a dataset as big as Common Crawl does not seem to have significant positive effect which reiterates the importance of developing conversational language models

Paper

Share this book

Add to My Shelf

RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

by Cohen, Taco , Carbonneaux, Quentin , Synnaeve, Gabriel in Feedback , Large language models

2025

Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve the desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks, where we achieve new state-of-the art results with both small (8B parameters) and large (70B) models while reducing the amount of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.

Paper

Share this book

Add to My Shelf

Short window attention enables long-term memorization

by Jégou, Hervé , Szilvasy, Gergely , Lomeli, Maria in Attention , Recurrent neural networks , Sliding

2025

Recent works show that hybrid architectures combining sliding window softmax attention layers with linear recurrent neural network (RNN) layers outperform both of these architectures taken separately. However, the impact of the window length and the interplay between softmax attention and linear RNN layers remain under-studied. In this work, we introduce SWAX, a hybrid architecture consisting of sliding-window attention and xLSTM linear RNN layers. A counter-intuitive finding with SWAX is that larger sliding windows do not improve the long-context performance. In fact, short window attention encourages the model to better train the long-term memory of the xLSTM, by relying less on the softmax attention mechanism for long context-retrieval. The issue with small sliding windows is that they are detrimental for short-context tasks, which could be solved with information from moderately larger sliding windows otherwise. Therefore, we train SWAX by stochastically changing the sliding window size, forcing the model to leverage both a longer context window and the xLSTM memory. SWAX trained with stochastic window sizes significantly outperforms regular window attention both on short and long-context problems.

Paper

Share this book

Add to My Shelf

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

by Singh, Rishabh , Wei, Yuxiang , Wang, Sida I in Coding , Evolution , Large language models

2025

The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer's reasoning processes and solutions by learning from extensive open-source software evolution data -- the record of a software's entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.

Paper

Share this book

Add to My Shelf

RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

by Cohen, Taco , Synnaeve, Gabriel , Gehring, Jonas in Feedback , Large language models

2024

Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks, where we achieve new start-of-the art results with both small (8B parameters) and large (70B) models while reducing the amount of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.

Paper

Share this book

Add to My Shelf

Pushing the performances of ASR models on English and Spanish accents

by Copet, Jade , Chitkara, Pooja , Zhang, Frank in English language , Speech recognition

2022

Speech to text models tend to be trained and evaluated against a single target accent. This is especially true for English for which native speakers from the United States became the main benchmark. In this work, we are going to show how two simple methods: pre-trained embeddings and auxiliary classification losses can improve the performance of ASR systems. We are looking for upgrades as universal as possible and therefore we will explore their impact on several models architectures and several languages.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter