Catalogue Search | MBRL

Biologically-Inspired Spike-Based Automatic Speech Recognition of Isolated Digits Over a Reproducing Kernel Hilbert Space

by Li, Kan , Príncipe, José C. in Acoustics , Algorithms , Firing pattern

2018

This paper presents a novel real-time dynamic framework for quantifying time-series structure in spoken words using spikes. Audio signals are converted into multi-channel spike trains using a biologically-inspired leaky integrate-and-fire (LIF) spike generator. These spike trains are mapped into a function space of infinite dimension, i.e., a Reproducing Kernel Hilbert Space (RKHS) using point-process kernels, where a state-space model learns the dynamics of the multidimensional spike input using gradient descent learning. This kernelized recurrent system is very parsimonious and achieves the necessary memory depth via feedback of its internal states when trained discriminatively, utilizing the full context of the phoneme sequence. A main advantage of modeling nonlinear dynamics using state-space trajectories in the RKHS is that it imposes no restriction on the relationship between the exogenous input and its internal state. We are free to choose the input representation with an appropriate kernel, and changing the kernel does not impact the system nor the learning algorithm. Moreover, we show that this novel framework can outperform both traditional hidden Markov model (HMM) speech processing as well as neuromorphic implementations based on spiking neural network (SNN), yielding accurate and ultra-low power word spotters. As a proof of concept, we demonstrate its capabilities using the benchmark TI-46 digit corpus for isolated-word automatic speech recognition (ASR) or keyword spotting. Compared to HMM using Mel-frequency cepstral coefficient (MFCC) front-end without time-derivatives, our MFCC-KAARMA offered improved performance. For spike-train front-end, spike-KAARMA also outperformed state-of-the-art SNN solutions. Furthermore, compared to MFCCs, spike trains provided enhanced noise robustness in certain low signal-to-noise ratio (SNR) regime.

Journal Article

Share this book

Add to My Shelf

Enhanced Automatic Speech Recognition System Based on Enhancing Power-Normalized Cepstral Coefficients

by Gouda, Ahmed , Khedr, Mohamed , Tamazin, Mohamed in Bias , feature extraction , Fourier transforms

2019

Many new consumer applications are based on the use of automatic speech recognition (ASR) systems, such as voice command interfaces, speech-to-text applications, and data entry processes. Although ASR systems have remarkably improved in recent decades, the speech recognition system performance still significantly degrades in the presence of noisy environments. Developing a robust ASR system that can work in real-world noise and other acoustic distorting conditions is an attractive research topic. Many advanced algorithms have been developed in the literature to deal with this problem; most of these algorithms are based on modeling the behavior of the human auditory system with perceived noisy speech. In this research, the power-normalized cepstral coefficient (PNCC) system is modified to increase robustness against the different types of environmental noises, where a new technique based on gammatone channel filtering combined with channel bias minimization is used to suppress the noise effects. The TIDIGITS database is utilized to evaluate the performance of the proposed system in comparison to the state-of-the-art techniques in the presence of additive white Gaussian noise (AWGN) and seven different types of environmental noises. In this research, one word is recognized from a set containing 11 possibilities only. The experimental results showed that the proposed method provides significant improvements in the recognition accuracy at low signal to noise ratios (SNR). In the case of subway noise at SNR = 5 dB, the proposed method outperforms the mel-frequency cepstral coefficient (MFCC) and relative spectral (RASTA)–perceptual linear predictive (PLP) methods by 55% and 47%, respectively. Moreover, the recognition rate of the proposed method is higher than the gammatone frequency cepstral coefficient (GFCC) and PNCC methods in the case of car noise. It is enhanced by 40% in comparison to the GFCC method at SNR 0dB, while it is improved by 20% in comparison to the PNCC method at SNR −5dB.

Journal Article

Share this book

Add to My Shelf

Modeling State-Conditional Observation Distribution Using Weighted Stereo Samples for Factorial Speech Processing Models

by Homayounpour, Mohammad Mehdi , Khademian, Mahdi in Auroras , Circuits and Systems , Compensation

2017

This paper investigates the effectiveness of factorial speech processing models in noise-robust automatic speech recognition tasks. For this purpose, the paper proposes an idealistic approach for modeling state-conditional observation distribution of factorial models based on weighted stereo samples. This approach is an extension to previous single-pass retraining for ideal model compensation which is extended here to support multiple audio sources. Non-stationary noises can be considered as one of these audio sources with multiple states. Experiments of this paper over the set A of the Aurora 2 dataset show that recognition performance can be improved by this consideration. The improvement is significant in low signal-to-noise energy conditions, up to 4 % absolute word recognition accuracy. In addition to the power of the proposed method in accurate representation of state-conditional observation distribution, it has an important advantage over previous methods by providing the opportunity to independently select feature spaces for both source and corrupted features. This opens a new window for seeking better feature spaces appropriate for noisy speech, independent from clean speech features.

Journal Article

Share this book

Add to My Shelf

Deep Learning Based Array Processing for Speech Separation, Localization, and Recognition

by Wang, Zhong-Qiu in Computer Engineering , Computer science

2020

Microphone arrays are widely deployed in modern speech communication systems. With multiple microphones, spatial information is available in addition to spectral cues to improve speech enhancement, speaker separation and robust automatic speech recognition (ASR) in noisy-reverberant environments. Conventionally, multi-microphone beamforming followed by monaural post-filtering is the dominant approach for multi-channel speech enhancement. This approach requires an accurate estimate of target direction, and power spectral density and covariance matrices of speech and noise. Such estimation algorithms usually cannot achieve satisfactory accuracy in noisy and reverberant conditions. Recently, riding on the development of deep neural networks (DNN), time-frequency (T-F) masking and spectral mapping based approaches have been established as the mainstream methodology for monaural (single-channel) speech separation, including speech enhancement and speaker separation. This dissertation investigates deep learning based microphone array processing and its application to speech separation and localization, and robust ASR. We start our work by exploring various ways of integrating speech enhancement and acoustic modeling for single-channel robust ASR. We propose a training framework that jointly trains enhancement frontends, filterbanks and backend acoustic models. We also apply sequence-discriminative training for sequence modeling and run-time unsupervised adaptation to deal with training and testing mismatches. One essential aspect of multi-channel processing is sound localization. We utilize deep learning based T-F masking to identify T-F units dominated by target speaker and only use these T-F units for speaker localization, as they contain much cleaner phases that are informative for localization. This approach dramatically improves the robustness of conventional cross-correlation, beamforming and subspace based approaches for speaker localization in noisy-reverberant environments. Building upon speaker localization, we next tightly integrate complementary spectral and spatial cues for deep learning based multi-channel speaker separation in reverberant environments. The key idea is to localize individual speakers and use the localization results to design spatial features that can indicate whether each T-F unit is dominated by the speech arriving from the estimated speaker direction. The spatial features are combined with spectral features in an enhancement network to extract the speaker from an estimated direction and with trained spectral structure. Strong separation performance has been observed on reverberant talker-independent speaker separation tasks. Before addressing multi-channel speech enhancement, we explore various magnitude based phase reconstruction algorithms for monaural speaker separation. We also study complex spectral mapping based phase estimation, where we directly predict the real and imaginary components of target speech. We find that deep learning based magnitude estimates clearly benefit phase reconstruction, and complex spectral mapping leads to better phase estimation. We then apply complex spectral mapping to multi-channel speech dereverberation and enhancement, where phase estimation is used to improve sound localization, time-invariant and time-varying beamforming, and post-filtering. State-of-the-art performance has been obtained on the enhancement and recognition tasks of the REVERB corpus and the CHiME-4 dataset. Finally, for fixed-geometry arrays, we propose multi-microphone complex spectral mapping for speech dereverberation, where DNNs are used for time-varying non-linear beamforming. We find that concatenating multiple microphone signals for complex spectral mapping is a simple and effective way of integrating spectral and spatial information for fixed-geometry arrays.

Dissertation

Share this book

Add to My Shelf

A Bayesian view on acoustic model-based techniques for robust speech recognition

by Sehr, Armin , Maas, Roland , Huemmer, Christian in Approximation , Bayesian analysis , Decoding

2015

This article provides a unifying Bayesian view on various approaches for acoustic model adaptation, missing feature, and uncertainty decoding that are well-known in the literature of robust automatic speech recognition. The representatives of these classes can often be deduced from a Bayesian network that extends the conventional hidden Markov models used in speech recognition. These extensions, in turn, can in many cases be motivated from an underlying observation model that relates clean and distorted feature vectors. By identifying and converting the observation models into a Bayesian network representation, we formulate the corresponding compensation rules. We thus summarize the various approaches as approximations or modifications of the same Bayesian decoding rule leading to a unified view on known derivations as well as to new formulations for certain approaches.

Journal Article

Share this book

Add to My Shelf

Noise robust automatic speech recognition: review and analysis

by Dua, Mohit , Akanksha , Dua, Shelza in Acknowledgment , Acoustics , Artificial Intelligence

2023

Automatic Speech Recognition (ASR) system is an emerging technology used in various fields such as robotics, traffic controls, and healthcare, etc. The leading cause of ASR performance degradation is mismatch between the training and testing environments. The main reason for this mismatch is the presence of noise during the testing phase of an ASR system. Various techniques have been used by different researchers in front and backend phases of ASR, to detect and handle the noise. However, a very few review papers have considered noise as a criterion to present the comparison among the existing research works. Hence, the objective of this survey is to analyze and review all the effective methods proposed by different scientists and researchers to boost the noise robustness of an ASR system. Initially, the paper discusses the basic architecture of an ASR system, the factors affecting the its performance, and noise problem formulation. Secondly, the work analysis existing state of the art noise robust ASR methods in terms of front end feature extraction techniques and backend classification model. Then, a detailed review in terms of various speech databases, that are used by these methods, is given. Finally, an analysis in terms of performance metrics of all these noise-resistant ASR techniques is presented. Also, the paper discusses various feature extraction techniques, backend classification methods, different speech databases and performance metrics in detail, while presenting the analysis. The paper also discusses the existing challenges, and describes future research directions in the area of building noise-resistant ASR systems.

Journal Article

Share this book

Add to My Shelf

An ensemble of optimal smoothing and minima controlled through iterative averaging for speech enhancement under uncontrolled environment

by B, Dinesh Rao , C B, Chandrakala , G P, Raghudathesh in Acknowledgment , Algorithms , Automatic speech recognition

2025

Although better progress has been made in the area of speech enhancement, a significant performance degradation still exists under highly non-stationary noisy conditions. These conditions have a detrimental impact on the performance of the speech processing applications such as automatic speech recognition, speech encoding, speaker verification, speaker identification, and speaker recognition. Therefore, in this work, a robust noise estimation technique is proposed for speech enhancement under highly non-stationary noisy scenarios. The proposed work introduces an optimal smoothing and minima controlled (OSMC) through an iterative averaging method for noise estimation. Firstly, the computation of smooth power spectrum of degraded speech data and tracking the minima by continuously taking the past spectral average values are considered. Then, to find the activity of speech in each frequency bin, the ratio of degraded speech spectrum to its local minimum is considered, and a Bayes minimum-cost rule is applied for the decision-making. Finally, the spectrum of noise is estimated using the time-frequency dependent smoothing factors which mainly depend on the estimation of the probability of speech presence. The experiments are conducted on NOIZEUS and Kannada speech databases. The evaluated results demonstrated that the proposed OSMC technique exhibits better speech quality and intelligibility performance compared to existing algorithms under highly non-stationary noisy conditions.

Journal Article

Share this book

Add to My Shelf

PRL-DAS: Robust Heliox Speech Recognition for Unaligned Low-Resource Data

by Zhang, Shibing , Chen, Yonghong , Wen, Wanzhi in Acoustics , Adaptation , Analysis

2026

Speech produced in helium–oxygen (heliox) environments in deep saturation diving exhibits pronounced spectral shifts and temporal distortions, which severely degrade automatic speech recognition (ASR) systems trained on normal-air corpora. Existing studies often adopt a restoration-then-recognition paradigm by training waveform mapping networks on paired heliox/air recordings. However, in realistic low-resource data collection, paired recordings are typically obtained by independent re-reading and are therefore not strictly time-aligned, which makes regression-style restoration more sensitive to pairing errors and increases the risk of front-end distortions. This paper proposes a robust recognition framework for heliox speech, termed PRL-DAS (Physics-informed Resampling and LoRA with Duration-Adaptive Speed). The framework consists of a physics-inspired linear resampling warm start (PhysSpeed), parameter-efficient Low-Rank Adaptation (LoRA), and duration-adaptive speed (DAS) inference enhancement. Specifically, we first apply physics-motivated linear resampling as a coarse warm start, and then perform mixed-domain LoRA fine-tuning of a Whisper foundation model to absorb residual non-linear differences. On a corpus of 1048 paired Chinese heliox utterances under leave-one-speaker-out (LOSO) evaluation, using Whisper-Medium as the base model, PhysSpeed followed by mixed-domain LoRA reduces the overall character error rate (CER) from 49.33% with PhysSpeed preprocessing only to 25.79%, while also improving performance on the normal domain. Furthermore, the full PRL-DAS framework applies Soft-DAS, a lightweight smooth schedule motivated by duration-dependent variation in the optimal resampling factor, and further reduces the overall CER to 24.37% without additional training cost.

Journal Article

Share this book

Add to My Shelf

Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition

by Shi Yuchen , Rigoll Gerhard , Watzel Tobias in Acknowledgment , Attention , Automatic speech recognition

2021

Lately, the self-attention mechanism has marked a new milestone in the field of automatic speech recognition (ASR). Nevertheless, its performance is susceptible to environmental intrusions as the system predicts the next output symbol depending on the full input sequence and the previous predictions. A popular solution for this problem is adding an independent speech enhancement module as the front-end. Nonetheless, due to being trained separately from the ASR module, the independent enhancement front-end falls into the sub-optimum easily. Besides, the handcrafted loss function of the enhancement module tends to introduce unseen distortions, which even degrade the ASR performance. Inspired by the extensive applications of the generative adversarial networks (GANs) in speech enhancement and ASR tasks, we propose an adversarial joint training framework with the self-attention mechanism to boost the noise robustness of the ASR system. Generally, it consists of a self-attention speech enhancement GAN and a self-attention end-to-end ASR model. There are two advantages which are worth noting in this proposed framework. One is that it benefits from the advancement of both self-attention mechanism and GANs, while the other is that the discriminator of GAN plays the role of the global discriminant network in the stage of the adversarial joint training, which guides the enhancement front-end to capture more compatible structures for the subsequent ASR module and thereby offsets the limitation of the separate training and handcrafted loss functions. With the adversarial joint optimization, the proposed framework is expected to learn more robust representations suitable for the ASR task. We execute systematic experiments on the corpus AISHELL-1, and the experimental results show that on the artificial noisy test set, the proposed framework achieves the relative improvements of 66% compared to the ASR model trained by clean data solely, 35.1% compared to the speech enhancement and ASR scheme without joint training, and 5.3% compared to multi-condition training.

Journal Article

Share this book

Add to My Shelf

Systematic Annotation Framework for Robust Speech Recognition

by Guo, Yuanbo , Chen, Yongqing , Xie, Xia in Acoustics , Annotations , automatic speech recognition

2026

This study proposes a systematic annotation framework to improve the robustness of end-to-end automatic speech recognition (ASR) in a complex low-resource dialect setting, using Hainan Lingao dialect as a case study. The framework consists of three components: semantically complete utterance segmentation instead of fixed-duration clipping; structured annotation at the lexical, sentence, and pragmatic-behavior levels, including explicit tags for dialectal variation, environmental noise, and unintelligible speech as well as rules for handling overlapping speech; and a three-stage quality-assurance workflow with iterative guideline refinement. The framework was implemented in the construction of a Hainan Lingao dialect corpus from 16 speakers and evaluated using 80 h/10 h/10 h training, validation, and test splits under an identical Conformer-based ASR configuration. Compared with a plain-transcription baseline using no special tags and fixed 3 s segmentation, the full specification reduced character error rate (CER) from 8.7% to 7.9%, 24.3% to 18.5%, 19.5% to 15.2%, and 15.2% to 13.1% on clean, noisy, dialogue, and dialect-variation test sets, respectively. The corresponding sentence error rate (SER) decreased from 17.5% to 15.2%, 39.6% to 32.1%, 34.2% to 27.8%, and 28.3% to 24.5%. Ablation experiments further examined the individual contributions of pragmatic-behavior tags, noise tags, semantic segmentation, and dialect-feature annotation. Paired bootstrap testing with 10,000 resamples showed that all baseline-to-full-specification improvements were statistically significant (p < 0.01). These results indicate that systematic annotation can improve ASR robustness in this Lingao low-resource dialect setting, with the largest relative CER reductions observed in the noisy (23.7%) and dialogue (22.1%) scenarios.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter