Catalogue Search | MBRL

How to wreck a nice beach : the vocoder from World War II to hip-hop : the machine speaks

by Tompkins, Dave in Electronic music History and criticism. , Vocoder. , Sound Recording and reproducing.

Book

Share this book

Add to My Shelf

Research on Speech Synthesis Technology Based on Rhythm Embedding

by Wu, Tianxin , Zhao, Lasheng , Zhang, Qiang

2020

In recent years, Text-To-Speech (TTS) technology has developed rapidly. People have also been paying more attention to how to narrow the gap between synthetic speech and real speech, hoping that synthesized speech can be integrated with real rhythm. A rhythmic feature embedding method for Text-To-Speech was proposed in this thesis based on Tacotron2 model, which has arisen in the field of TTS in recent years. Firstly, rhythmic feature extraction through World vocoder can reduce redundant information in rhythmic features. Then, rhythmic feature fusion based on Variational Auto-Encoder (VAE) network can enhance rhythmic information. Experiments are carried out on the data set LJSpeech-1.0, and then subjective evaluation and objective evaluation are carried out on the synthesized speech respectively. Compared with the comparative literature, the subjective blind hearing test (ABX) score increased by 25%. At that same time, the objective Mel Cepstral Distortion value (MCD) declined to 12.77.

Journal Article

Share this book

Add to My Shelf

Pseudo-Cepstrum: Pitch Modification for Mel-Based Neural Vocoders

by Oh, Junkwang , Tsiakoulis, Pirros , Kakoulidis, Panos in Vocoders

2025

This paper introduces a cepstrum-based pitch modification method that can be applied to any mel-spectrogram representation. As a result, this method is compatible with any mel-based vocoder without requiring any additional training or changes to the model. This is achieved by directly modifying the cepstrum feature space in order to shift the harmonic structure to the desired target. The spectrogram magnitude is computed via the pseudo-inverse mel transform, then converted to the cepstrum by applying DCT. In this domain, the cepstral peak is shifted without having to estimate its position and the modified mel is recomputed by applying IDCT and mel-filterbank. These pitch-shifted mel-spectrogram features can be converted to speech with any compatible vocoder. The proposed method is validated experimentally with objective and subjective metrics on various state-of-the-art neural vocoders as well as in comparison with traditional pitch modification methods.

Paper

Share this book

Add to My Shelf

Synthetic speech detection through short-term and long-term prediction traces

by Tubaro Stefano , Bestagini Paolo , Borrelli, Clara in Algorithms , Deep learning , Feature extraction

2021

Several methods for synthetic audio speech generation have been developed in the literature through the years. With the great technological advances brought by deep learning, many novel synthetic speech techniques achieving incredible realistic results have been recently proposed. As these methods generate convincing fake human voices, they can be used in a malicious way to negatively impact on today’s society (e.g., people impersonation, fake news spreading, opinion formation). For this reason, the ability of detecting whether a speech recording is synthetic or pristine is becoming an urgent necessity. In this work, we develop a synthetic speech detector. This takes as input an audio recording, extracts a series of hand-crafted features motivated by the speech-processing literature, and classify them in either closed-set or open-set. The proposed detector is validated on a publicly available dataset consisting of 17 synthetic speech generation algorithms ranging from old fashioned vocoders to modern deep learning solutions. Results show that the proposed method outperforms recently proposed detectors in the forensics literature.

Journal Article

Share this book

Add to My Shelf

Neural adaptations to temporal cues degradation in early blind: insights from envelope and fine structure vocoding

by Kyong, Jeong-Sug , Shim, Hyun Joon , Won, Jong Ho in N2 and P3b , speech intelligibility , temporal degradation

2025

In our previous study, early-blind individuals have better speech recognition than sighted individuals, even when the spectral cue was degraded using noise-vocoders. Therefore, this study investigated the impact of temporal envelope degradation and temporal fine structure (TFS) degradation on vocoded speech recognition and cortical auditory response in early blind individuals compared to sighted individuals. The study included 20 early-blind subjects (31.20 ± 42.5 years, M: F = 11:9), and 20 age- and -sex-matched sighted subjects. Monosyllabic words were processed using the Hilbert transform to separate the envelope and TFS, generating vocoders that included only one of these components. The amplitude modulation (AM) vocoder, which contained only the envelope component, had the low-pass filter's cutoff frequency for AM extraction set at 16, 50, and 500 Hz to control the amount of AM cue. The frequency modulation (FM) vocoders, which contained only the TFS component, were adjusted to include FM cues at 50%, 75%, and 100% by modulating the noise level. A two-way repeated measures ANOVA revealed that early-blind subjects outperforming sighted subjects across almost all AM or FM-vocoded conditions ( p < 0.01). Speech recognition in early-blind subjects declined more with increasing TFS degradation, as evidenced by a significant interaction between group and the degree of TFS degradation ( p = 0.016). We also analyzed neural responses based on the semantic oddball paradigm using the N2 and P3b components, which occur 200–300 ms and 250–800 ms after stimulus onset, respectively. Significant correlations were observed between N2 and P3b amplitude/latency and behavioral accuracy ( p < 0.05). This suggests that early-blind subjects may develop enhanced neural processing strategies for temporal cues. In particular, preserving TFS cues is considered important for the auditory rehabilitation of individuals with visual or auditory impairments.

Journal Article

Share this book

Add to My Shelf

Effect of spectral degradation on speech intelligibility and cortical representation

by Kyong, Jeong-Sug , Shim, Hyun Joon , Won, Jong Ho in event-related potential , N2 and P3b , spectral degradation

2024

Noise-vocoded speech has long been used to investigate how acoustic cues affect speech understanding. Studies indicate that reducing the number of spectral channel bands diminishes speech intelligibility. Despite previous studies examining the channel band effect using earlier event-related potential (ERP) components, such as P1, N1, and P2, a clear consensus or understanding remains elusive. Given our hypothesis that spectral degradation affects higher-order processing of speech understanding beyond mere perception, we aimed to objectively measure differences in higher-order abilities to discriminate or interpret meaning. Using an oddball paradigm with speech stimuli, we examined how neural signals correlate with the evaluation of speech stimuli based on the number of channel bands measuring N2 and P3b components. In 20 young participants with normal hearing, we measured speech intelligibility and N2 and P3b responses using a one-syllable task paradigm with animal and non-animal stimuli across four vocoder conditions with 4, 8, 16, or 32 channel bands. Behavioral data from word repetition clearly affected the number of channel bands, and all pairs were significantly different ( p < 0.001). We also observed significant effects of the number of channels on the peak amplitude [ F (2.006, 38.117) = 9.077, p < 0.001] and peak latency [ F (3, 57) = 26.642, p < 0.001] of the N2 component. Similarly, the P3b component showed significant main effects of the number of channel bands on the peak amplitude [ F (2.231, 42.391) = 13.045, p < 0.001] and peak latency [ F (3, 57) = 2.968, p = 0.039]. In summary, our findings provide compelling evidence that spectral channel bands profoundly influence cortical speech processing, as reflected in the N2 and P3b components, a higher-order cognitive process. We conclude that spectrally degraded one-syllable speech primarily affects cortical responses during semantic integration.

Journal Article

Share this book

Add to My Shelf

Continuous vocoder applied in deep neural network based voice conversion

by Németh, Géza , Tamás Gábor Csapó , Mohammed Salah Al-Radhi in Acoustic noise , Artificial neural networks , Conversion

2019

In this paper, a novel vocoder is proposed for a Statistical Voice Conversion (SVC) framework using deep neural network, where multiple features from the speech of two speakers (source and target) are converted acoustically. Traditional conversion methods focus on the prosodic feature represented by the discontinuous fundamental frequency (F0) and the spectral envelope. Studies have shown that speech analysis/synthesis solutions play an important role in the overall quality of the converted voice. Recently, we have proposed a new continuous vocoder, originally for statistical parametric speech synthesis, in which all parameters are continuous. Therefore, this work introduces a new method by using a continuous F0 (contF0) in SVC to avoid alignment errors that may happen in voiced and unvoiced segments and can degrade the converted speech. Our contribution includes the following. (1) We integrate into the SVC framework the continuous vocoder, which provides an advanced model of the excitation signal, by converting its contF0, maximum voiced frequency, and spectral features. (2) We show that the feed-forward deep neural network (FF-DNN) using our vocoder yields high quality conversion. (3) We apply a geometric approach to spectral subtraction (GA-SS) in the final stage of the proposed framework, to improve the signal-to-noise ratio of the converted speech. Our experimental results, using two male and one female speakers, have shown that the resulting converted speech with the proposed SVC technique is similar to the target speaker and gives state-of-the-art performance as measured by objective evaluation and subjective listening tests.

Journal Article

Share this book

Add to My Shelf

Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis

by Al-Radhi, Mohammed Salah , Mandeel, Ali Raheem , Csapó, Tamás Gábor in Adaptation , Artificial neural networks , English language

2023

This paper presents an investigation of speaker adaptation using a continuous vocoder for parametric text-to-speech (TTS) synthesis. In purposes that demand low computational complexity, conventional vocoder-based statistical parametric speech synthesis can be preferable. While capable of remarkable naturalness, recent neural vocoders nonetheless fall short of the criteria for real-time synthesis. We investigate our former continuous vocoder, in which the excitation is characterized employing two one-dimensional parameters: Maximum Voiced Frequency and continuous fundamental frequency (F0). We show that an average voice can be trained for deep neural network-based TTS utilizing data from nine English speakers. We did speaker adaptation experiments for each target speaker with 400 utterances (approximately 14 minutes). We showed an apparent enhancement in the quality and naturalness of synthesized speech compared to our previous work by utilizing the recurrent neural network topologies. According to the objective studies (Mel-Cepstral Distortion and F0 correlation), the quality of speaker adaptation using Continuous Vocoder-based DNN-TTS is slightly better than the WORLD Vocoder-based baseline. The subjective MUSHRA-like test results also showed that our speaker adaptation technique is almost as natural as the WORLD vocoder using Gated Recurrent Unit and Long Short Term Memory networks. The proposed vocoder, being capable of real-time synthesis, can be used for applications which need fast synthesis speed.

Journal Article

Share this book

Add to My Shelf

Vocoder Simulations Explain Complex Pitch Perception Limitations Experienced by Cochlear Implant Users

by Oxenham, Andrew J. , Mehta, Anahita H. in Cochlea , Cochlear implants , Frequency

2017

Pitch plays a crucial role in speech and music, but is highly degraded for people with cochlear implants, leading to severe communication challenges in noisy environments. Pitch is determined primarily by the first few spectrally resolved harmonics of a tone. In implants, access to this pitch is limited by poor spectral resolution, due to the limited number of channels and interactions between adjacent channels. Here we used noise-vocoder simulations to explore how many channels, and how little channel interaction, are required to elicit pitch. Results suggest that two to four times the number of channels are needed, along with interactions reduced by an order of magnitude, than available in current devices. These new constraints not only provide insights into the basic mechanisms of pitch coding in normal hearing but also suggest that spectrally based complex pitch is unlikely to be generated in implant users without significant changes in the method or site of stimulation.

Journal Article

Share this book

Add to My Shelf

Visual Context Enhanced: The Joint Contribution of Iconic Gestures and Visible Speech to Degraded Speech Comprehension

by Drijvers, Linda , Özyürek, Asli in Accuracy , Acoustics , Actors

2017

Purpose: This study investigated whether and to what extent iconic co-speech gestures contribute to information from visible speech to enhance degraded speech comprehension at different levels of noise-vocoding. Previous studies of the contributions of these 2 visual articulators to speech comprehension have only been performed separately. Method: Twenty participants watched videos of an actress uttering an action verb and completed a free-recall task. The videos were presented in 3 speech conditions (2- band noise-vocoding, 6-band noise-vocoding, clear), 3 multimodal conditions (speech + lips blurred, speech + visible speech, speech + visible speech + gesture), and 2 visual-only conditions (visible speech, visible speech + gesture). Results: Accuracy levels were higher when both visual articulators were present compared with 1 or none. The enhancement effects of (a) visible speech, (b) gestural information on top of visible speech, and (c) both visible speech and iconic gestures were larger in 6-band than 2-band noise-vocoding or visual-only conditions. Gestural enhancement in 2-band noise-vocoding did not differ from gestural enhancement in visual-only conditions. Conclusions: When perceiving degraded speech in a visual context, listeners benefit more from having both visual articulators present compared with 1. This benefit was larger at 6-band than 2-band noise-vocoding, where listeners can benefit from both phonological cues from visible speech and semantic cues from iconic gestures to disambiguate speech.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter