Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
111
result(s) for
"Wei-Ning, Hsu"
Sort by:
On Generative Spoken Language Modeling from Raw Audio
by
Bolte, Benjamin
,
Kharitonov, Eugene
,
Baevski, Alexei
in
Acoustics
,
Automatic text generation
,
Computation and Language
2021
We introduce
, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo- text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder- dependent way, and that some combinations approach text-based systems.
Journal Article
Generative Spoken Dialogue Language Modeling
by
Nguyen, Tu Anh
,
Kharitonov, Eugene
,
Tomasello, Paden
in
Computation and Language
,
Computer Science
,
Conversation
2023
We introduce dGSLM, the first “textless” model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter, and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn taking compared to a text-based cascaded model.
,
Journal Article
Continual Learning for On-Device Speech Recognition using Disentangled Conformers
by
Diwan, Anuj
,
Ching-Feng Yeh
,
Paden Tomasello
in
Algorithms
,
Automatic speech recognition
,
Initiatives
2022
Automatic speech recognition research focuses on training and evaluating on static datasets. Yet, as speech models are increasingly deployed on personal devices, such models encounter user-specific distributional shifts. To simulate this real-world scenario, we introduce LibriContinual, a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks, with data corresponding to 118 individual speakers and 6 train splits per speaker of different sizes. Additionally, current speech recognition models and continual learning algorithms are not optimized to be compute-efficient. We adapt a general-purpose training algorithm NetAug for ASR and create a novel Conformer variant called the DisConformer (Disentangled Conformer). This algorithm produces ASR models consisting of a frozen 'core' network for general-purpose use and several tunable 'augment' networks for speaker-specific tuning. Using such models, we propose a novel compute-efficient continual learning algorithm called DisentangledCL. Our experiments show that the DisConformer models significantly outperform baselines on general ASR i.e. LibriSpeech (15.58% rel. WER on test-other). On speaker-specific LibriContinual they significantly outperform trainable-parameter-matched baselines (by 20.65% rel. WER on test) and even match fully finetuned baselines in some settings.
Speech Processing with Less Supervision: Learning from Weak Labels and Multiple Modalities
2020
In recent years, supervised learning has achieved great success in speech processing with powerful neural network models and vast quantities of in-domain labeled data. However, collecting a labeled dataset covering all domains can be either expensive due to the diversity of speech or almost impossible for some tasks such as speech-tospeech translation. Such a paradigm limits the applicability of speech technologies to high-resource settings. In sharp contrast, humans are good at reading the training signals from indirect supervision, such as from small amount of explicit labels and from different modalities. This capability enables humans to learn from a wider variety of resources, including better domain coverage. In light of this observation, this thesis focuses on learning algorithms for speech processing that can utilize weak and indirect supervision to overcome the restrictions imposed by the supervised paradigm and make the most out of the data at hand for learning.In the first part of the thesis, we devise a self-training algorithm for speech recognition that distills knowledge from a trained language model, a compact form of external non-speech prior knowledge. The algorithm is inspired by how humans use contextual and prior information to bias speech recognition and produce confident predictions. To distill knowledge within the language model, we implement a beamsearch based objective to align the prediction probability with the likelihood of the language model among candidate hypotheses. Experimental results demonstrate stateof-the-art performance that recover word error rates by up to 90% relative to using the same data with ground truth transcripts. Moreover, we show that the proposed algorithm can scale to 60,000 hours of unlabeled speech and yield further reduction in word error rates.In the second part of the thesis, we present several text-to-speech synthesis models that enable fine-grained control of unlabeled non-textual attributes, including voice, prosody, acoustic environment properties and microphone channel effects. We achieve controllability of unlabeled attributes by formulating a text-to-speech system as a generative model with structured latent variables, and learn this generative process along with an efficient approximate inference model by adopting the variational autoencoder framework. We demonstrate that those latent variables can then be used to control the unlabeled variations in speech, making it possible to build a high-quality speech synthesis model using weakly-labeled mixed-quality speech data as the model learns to control the hidden factors.In the last part of the thesis, we extend a cross-modal semantic embedding learning framework proposed in Harwath et al. (2019) to learn hierarchical discrete linguistic units from visually grounded speech, a form of multimodal sensory data. By utilizing a discriminative, multimodal grounding objective, the proposed framework forces the learned units to be useful for semantic image retrieval. In contrast, most of the previous work on linguistic unit discovery do not use multimodal data—they consider a reconstruction objective that encourages the learned units to be useful for reconstructing the speech, and hence those units may also encode non-linguistic factors. Experimental results show that the proposed framework outperforms state-of-the-art phonetic unit discovery frameworks by almost 50% on the ZeroSpeech 2019 ABX phone discriminative task, and learns word detectors that discover over 270 words with an F1 score of greater than 0.5. In addition, the learned units from the proposed framework are also more robust to nuisance variation compared to frameworks that learn from only speech.
Dissertation
Unsupervised Learning of Disentangled Representations for Speech with Neural Variational Inference Models
2018
Despite recent successes in machine learning, artificial intelligence is still far from matching human intelligence in many ways. Two important aspects are transferability and amount of supervision required. Take speech recognition for example: while humans can easily adapt to a new accent without explicit supervision (i.e., ground truth transcripts for speech of a new accent), current machine learning techniques still struggle with such a scenario. We argue that an essential component of human learning is unsupervised or weakly supervised representation learning, which transforms input signals to low dimensional representations that facilitate subsequent structured learning and knowledge acquisition.In this thesis, we develop unsupervised representation learning frameworks for speech data. We start with investigating an existing variational autoencoder (VAE) model for learning latent representations, and derive novel latent space operations for speech transformation. The transformation method is applied to unsupervised domain adaptation problems, which addresses the transferability issues of supervised machine learning framework. We then extend the VAE models, and propose a novel factorized hierarchical variational autoencoder (FHVAE), which better models a generative process of sequential data, and learns not only disentangled, but also interpretable latent representations without any supervision. By leveraging the interpretability, we demonstrate that such representations can be applied to a wide range of tasks, including but not limited to: voice conversion, denoising, speaker verification, speaker invariant phonetic feature extraction, and noise invariant phonetic feature extraction. In the last part of this thesis, we examine scalability issues regarding the original FHVAE training algorithm in terms of runtime, memory, and optimization stability. Based on our analysis, we propose a hierarchical sampling algorithm for training, which enables training of FHVAE models on arbitrarily large datasets.
Dissertation
Who killed Cock Robin
2018
In this delicately scripted gut-wrenching psycho-thriller from award-winning filmmaker Cheng Wei-Hao (The Tag-Along), an ambitious journalist who witnessed a hit-and-run years ago reboots his investigation led by newly emerged clues. As he beats the clock to save the only survivor after her sudden disappearance, layers of unimaginable dark truths around a corrupted system start peeling.
Streaming Video
u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality
2022
While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled audio-visual data and the cost to deploy one model per modality. In this paper, we present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech with a unified masked cluster prediction objective. By utilizing modality dropout during pre-training, we demonstrate that a single fine-tuned model can achieve performance on par or better than the state-of-the-art modality-specific models. Moreover, our model fine-tuned only on audio can perform well with audio-visual and visual speech input, achieving zero-shot modality generalization for multiple speech processing tasks. In particular, our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input. Codes and models are available at https://github.com/facebookresearch/av_hubert
Cocktail HuBERT: Generalized Self-Supervised Pre-training for Mixture and Single-Source Speech
2023
Self-supervised learning leverages unlabeled data effectively, improving label efficiency and generalization to domains without labeled data. While recent work has studied generalization to more acoustic/linguistic domains, languages, and modalities, these investigations are limited to single-source speech with one primary speaker in the recording. This paper presents Cocktail HuBERT, a self-supervised learning framework that generalizes to mixture speech using a masked pseudo source separation objective. This objective encourages the model to identify the number of sources, separate and understand the context, and infer the content of masked regions represented as discovered units. Cocktail HuBERT outperforms state-of-the-art results with 69% lower WER on multi-speaker ASR, 31% lower DER on diarization, and is competitive on single- and multi-speaker tasks from SUPERB.
Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT
by
Bowen, Shi
,
Wei-Ning, Hsu
,
Abdelrahman, Mohamed
in
Audio data
,
Audio equipment
,
Audio visual equipment
2022
This paper investigates self-supervised pre-training for audio-visual speaker representation learning where a visual stream showing the speaker's mouth area is used alongside speech as inputs. Our study focuses on the Audio-Visual Hidden Unit BERT (AV-HuBERT) approach, a recently developed general-purpose audio-visual speech pre-training framework. We conducted extensive experiments probing the effectiveness of pre-training and visual modality. Experimental results suggest that AV-HuBERT generalizes decently to speaker related downstream tasks, improving label efficiency by roughly ten fold for both audio-only and audio-visual speaker verification. We also show that incorporating visual information, even just the lip area, greatly improves the performance and noise robustness, reducing EER by 38% in the clean condition and 75% in noisy conditions.
Robust Self-Supervised Audio-Visual Speech Recognition
by
Bowen, Shi
,
Wei-Ning, Hsu
,
Abdelrahman, Mohamed
in
Audio data
,
Audio equipment
,
Audio visual equipment
2022
Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe. Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker. However, previous AVSR work focused solely on the supervised learning setup; hence the progress was hindered by the amount of labeled data available. In this work, we present a self-supervised AVSR framework built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the-art audio-visual speech representation learning model. On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by ~50% (28.0% vs. 14.1%) using less than 10% of labeled data (433hr vs. 30hr) in the presence of babble noise, while reducing the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average.