Catalogue Search | MBRL

On Generative Spoken Language Modeling from Raw Audio

by Bolte, Benjamin , Kharitonov, Eugene , Baevski, Alexei in Acoustics , Automatic text generation , Computation and Language

2021

We introduce , the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo- text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder- dependent way, and that some combinations approach text-based systems.

Journal Article

Share this book

Add to My Shelf

A neural speech decoding framework leveraging deep learning and speech synthesis

by Wang, Yao , Friedman, Daniel , Wang, Ran in 631/378/116/2394 , 631/378/2619/2618 , Accuracy

2024

Decoding human speech from neural signals is essential for brain–computer interface (BCI) technologies that aim to restore speech in populations with neurological deficits. However, it remains a highly challenging task, compounded by the scarce availability of neural signals with corresponding speech, data complexity and high dimensionality. Here we present a novel deep learning-based neural speech decoding framework that includes an ECoG decoder that translates electrocorticographic (ECoG) signals from the cortex into interpretable speech parameters and a novel differentiable speech synthesizer that maps speech parameters to spectrograms. We have developed a companion speech-to-speech auto-encoder consisting of a speech encoder and the same speech synthesizer to generate reference speech parameters to facilitate the ECoG decoder training. This framework generates natural-sounding speech and is highly reproducible across a cohort of 48 participants. Our experimental results show that our models can decode speech with high correlation, even when limited to only causal operations, which is necessary for adoption by real-time neural prostheses. Finally, we successfully decode speech in participants with either left or right hemisphere coverage, which could lead to speech prostheses in patients with deficits resulting from left hemisphere damage. Recent research has focused on restoring speech in populations with neurological deficits. Chen, Wang et al. develop a framework for decoding speech from neural signals, which could lead to innovative speech prostheses.

Journal Article

Share this book

Add to My Shelf

DESpeech: a dual-pass encoder approach for efficient speech-to-speech translation

by Ke, Dengfeng , Su, Kaile , Xu, Yanyan in Acoustics , Automatic speech recognition , Balancing

2026

Direct speech-to-speech translation (S2ST) systems have emerged as a promising approach for real-time cross-lingual communication. However, these systems face significant challenges in balancing translation quality with decoding efficiency. In this paper, we present DESpeech, a novel direct S2ST model that effectively addresses this challenge through a dual-pass encoder architecture. Our architecture decomposes translation into two specialized stages: acoustic feature extraction via a speech encoder and semantic understanding via a text encoder. This modular design enables optimal resource allocation while maintaining cross-modal information flow. To enhance performance, DESpeech employs discrete units as intermediate representations and adopts a multi-task learning framework that integrates automatic speech recognition and speech-to-text translation as auxiliary tasks. The dual-pass architecture allows for efficient pre-training integration and provides a natural framework for balancing computational efficiency with translation accuracy. Experiments on the CVSS-C and GigaS2S datasets show that DESpeech consistently outperforms or matches existing methods in terms of translation quality while achieving clear improvements in inference speed, indicating a promising approach for efficient S2ST with minimal quality degradation.

Journal Article

Share this book

Add to My Shelf

A New Speech Encoder Based on Dynamic Framing Approach

by Yue, Xiaoguang , Liu, Renyuan , Zhou, Xiaobing in Alignment , Coding , Embedding

2023

Latent information is difficult to get from the text in speech synthesis. Studies show that features from speech can get more information to help text encoding. In the field of speech encoding, a lot of work has been conducted on two aspects. The first aspect is to encode speech frame by frame. The second aspect is to encode the whole speech to a vector. But the scale in these aspects is fixed. So, encoding speech with an adjustable scale for more latent information is worthy of investigation. But current alignment approaches only support frame-by-frame encoding and speech-to-vector encoding. It remains a challenge to propose a new alignment approach to support adjustable scale speech encoding. This paper presents the dynamic speech encoder with a new alignment approach in conjunction with frame-by-frame encoding and speech-to-vector encoding. The speech feature from our model achieves three functions. First, the speech feature can reconstruct the origin speech while the length of the speech feature is equal to the text length. Second, our model can get text embedding from speech, and the encoded speech feature is similar to the text embedding result. Finally, it can transfer the style of synthesis speech and make it more similar to the given reference speech.

Journal Article

Share this book

Add to My Shelf

Fusion-s2igan: an efficient and effective single-stage framework for speech-to-image generation

by Schomaker, Lambert , Zhang, Zhenxing in Artificial Intelligence , Computational Biology/Bioinformatics , Computational Science and Engineering

2024

The goal of a speech-to-image transform is to produce a photo-realistic picture directly from a speech signal. Current approaches are based on a stacked modular framework that suffers from three vital issues: (1) Training separate networks is time-consuming, inefficient and the convergence of the final generative model depends on the previous generators; (2) The quality of precursor images is ignored; (3) Multiple discriminator networks need to be trained. We propose an efficient and effective single-stage framework called Fusion-S2iGan to yield perceptually plausible and semantically consistent image samples on the basis of spoken descriptions. Fusion-S2iGan introduces a visual+speech fusion module (VSFM), with a pixel-attention module (PAM), a speech-modulation module (SMM) and a weighted-fusion module (WFM), to inject the speech embedding from a speech encoder into the generator while improving the quality of synthesized pictures. The PAM module models the semantic affinities between pixel regions and by assigning larger weights to significant locations. The VSFM module adopts SMM to modulate visual feature maps using fine-grained linguistic cues present in the speech vector. Subsequently, the weighted-fusion model (WFM) captures the semantic importance of the image-attention mask and the speech-modulation module at the level of the channels, in an adaptive manner. Fusion-S2iGan spreads the bimodal information over all layers of the generator network to reinforce the visual feature maps at various hierarchical levels in the architecture. A series of experiments is conducted on four benchmark data sets: CUB birds, Oxford-102, Flickr8k and Places-subset. Results demonstrate the superiority of Fusion-S2iGan compared to the state-of-the-art models with a multi-stage architecture and a performance level that is close to traditional text-to-image approaches.

Journal Article

Share this book

Add to My Shelf

Speech-driven facial animation with spectral gathering and temporal attention

by CHAI, Yujin , ZHOU, Kun , WENG, Yanlin in Algorithms , Animation , Audio data

2022

In this paper, we present an efficient algorithm that generates lip-synchronized facial animation from a given vocal audio clip. By combining spectral-dimensional bidirectional long short-term memory and temporal attention mechanism, we design a light-weight speech encoder that learns useful and robust vocal features from the input audio without resorting to pre-trained speech recognition modules or large training data. To learn subject-independent facial motion, we use deformation gradients as the internal representation, which allows nuanced local motions to be better synthesized than using vertex offsets. Compared with state-of-the-art automatic-speech-recognition-based methods, our model is much smaller but achieves similar robustness and quality most of the time, and noticeably better results in certain challenging cases.

Journal Article

Share this book

Add to My Shelf

Video-driven speaker-listener generation based on Transformer and neural renderer

by Shao, Zhengxi , Chen, Jifeng , Liu, Qiong in Computer Communication Networks , Computer Science , Data Structures and Information Theory

2024

The traditional speaker-centric synthesis methods prioritize language accuracy but overlook emotional connection and feedback mechanisms with the audience. This paper is dedicated to an in-depth exploration of responsive speaker-listener generation, aiming to enhance communication by providing real-time non-verbal feedback such as head movements and facial expressions. Driven by video, we extract 3DMM coefficients to model facial features and head poses. Combining this with a Transformer speech encoder extracting 45-dimensional acoustic features, we achieve speaker generation at the sentence level. For responsive listener generation, we introduce two attention mechanisms in the Transformer decoder: cross-modal multi-head attention aligning audio-motion modalities and biased causal self-attention suitable for longer audio sequences. Finally, by aligning audio with a behavioral model and optimizing an enhanced neural renderer for facial images, we successfully achieve precise control over facial movements. Extensive experiments demonstrate the superiority of our approach compared to existing technologies.

Journal Article

Share this book

Add to My Shelf

CMU's IWSLT 2024 Simultaneous Speech Translation System

by Yan, Brian , Fernandes, Patrick , Neubig, Graham in Speech encoders

2024

This paper describes CMU's submission to the IWSLT 2024 Simultaneous Speech Translation (SST) task for translating English speech to German text in a streaming manner. Our end-to-end speech-to-text (ST) system integrates the WavLM speech encoder, a modality adapter, and the Llama2-7B-Base model as the decoder. We employ a two-stage training approach: initially, we align the representations of speech and text, followed by full fine-tuning. Both stages are trained on MuST-c v2 data with cross-entropy loss. We adapt our offline ST model for SST using a simple fixed hold-n policy. Experiments show that our model obtains an offline BLEU score of 31.1 and a BLEU score of 29.5 under 2 seconds latency on the MuST-C-v2 tst-COMMON.

Paper

Share this book

Add to My Shelf

Seal: Advancing Speech Language Models to be Few-Shot Learners

by Lei, Shuyu , Liu, Lingen , Guo, Xiang in Speech encoders

2024

Existing auto-regressive language models have demonstrated a remarkable capability to perform a new task with just a few examples in prompt, without requiring any additional training. In order to extend this capability to a multi-modal setting (i.e. speech and language), this paper introduces the Seal model, an abbreviation for speech language model. It incorporates a novel alignment method, in which Kullback-Leibler divergence loss is performed to train a projector that bridges a frozen speech encoder with a frozen language model decoder. The resulting Seal model exhibits robust performance as a few-shot learner on two speech understanding tasks. Additionally, consistency experiments are conducted to validate its robustness on different pre-trained language models.

Paper

Share this book

Add to My Shelf

Thaka at KSAA-2026 Task 2: Regularized Fine-Tuning for Arabic Speech Diacritization

by Alamr, Meshal , Aldahlawi, Abdullah , Alqaeri, Hassan in Regularization , Speech encoders

2026

We describe the winning system for Task 2 of the KSAA-2026 Shared Task on Arabic Speech Dictation with Automatic Diacritization. The task requires producing fully diacritized Arabic text from speech audio and undiacritized transcripts, with only 2,327 training samples available and no external data permitted. Our system fine-tunes CATT-Whisper, a character-level multimodal model combining a pretrained CATT text encoder with a frozen Whisper speech encoder. The key to our approach is training regularization: R-Drop consistency regularization, Optuna-optimized hyperparameters with high weight decay, and Focal Loss. At inference, we average 200 stochastic forward passes across four model checkpoints using Monte Carlo Dropout at the softmax probability level. The system achieves 23.26% WER on the primary leaderboard metric (with case endings, including no-diacritic positions), placing 1st among all participants.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter