Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
1,020
result(s) for
"Vocal tract"
Sort by:
Effects of fundamental frequency and vocal tract resonance on speech recognition in noise by non-native listeners
2025
The present study examined the influence of changes in speakers’ fundamental frequency (fo) and vocal tract resonance (VTR) on speech recognition in different types of noise by non-native listeners. The goal was to identify whether the fo-VTR relationship has a similar effect on non-native listeners as it does on native listeners. Twenty-six adults who were native Mandarin speakers learning English as a second language were presented with English Hearing-in-Noise Test (HINT) sentences in four voice conditions with the original male speaker's fo doubled and/or VTR scaled up by a factor of 1.2: (1) low fo low VTR (LfoLVTR, the original recordings); (2) low fo high VTR (LfoHVTR); (3) high fo high VTR (HfoHVTR), and (4) high fo low VTR (HfoLVTR). The stimuli were presented in speech-shaped noise (SSN) and four-talker babble (FTB) at signal-to-noise ratios of −3, 0, +3 dB. The results showed that the non-native listeners performed more poorly with fo-VTR mismatched voices than with fo-VTR matched voices and the negative influence of mismatched voice features was mainly manifested in the HfoLVTR condition. Compared to SSN, FTB had a greater adverse impact on the non-native listeners’ recognition accuracy. Further, the performance difference between matched and mismatched conditions showed distinct patterns across SSN and FTB.
Journal Article
Multilingual speech-to-vocal tract visualization using deep learning for pronunciation training
2025
Visualizing the vocal tract during speech remains a challenging task, even with recent advancements in open-source algorithms and datasets. A key limitation is the lack of multimodal resources that integrate audio with internal articulatory structures, which poses challenges to the development of effective speech visualization methods. In this work, we propose a novel algorithm that translates speech into dynamic vocal tract movements, enabling users to visually compare their articulation against a correctly pronounced reference. This approach is particularly beneficial for language learners and individuals with congenital deafness, as it provides an intuitive visualization of speech production. To address the dataset gap, we construct a new corpus that maps audio to articulatory parameters used in VocalTractLab. Leveraging pretrained models such as Wav2Vec 2.0 and HuBERT for audio feature extraction, we frame the task as a sequence-to-sequence learning problem and employ a Bi-GRU to predict the corresponding articulatory parameters. Furthermore, we develop language-specific datasets for General American English, Korean, and Brazilian Portuguese using the same pipeline, and train dedicated models for each. We validate the effectiveness of our approach through both qualitative and quantitative evaluations and assess its practical utility in pronunciation correction tasks through a user survey. The goal of the survey was to evaluate the perceived helpfulness and usability of the visualizations, not to implement an iterative pronunciation correction process.
Journal Article
Design of a computational method to optimise acoustic output of the human vocal tract
2024
The influence of the geometric configuration of the human vocal tract (HVT) on the character of acoustic energy distribution during phonation of the vowel [a:] has been analysed. The computationally efficient mathematic models of the HVT have been assembled based on super elements and an isoparametric element with higher degree of polynomial shape function. The assembled models enable the easy and quick geometrical reconfiguration of the HVT and they can be used for the time-consuming optimization process with aim to find the suitable geometric configuration of the HVT to generated the so-called singer’s formant.
Journal Article
Domestic dogs (Canis lupus familiaris) are sensitive to the correlation between pitch and timbre in human speech
2022
The perceived pitch of human voices is highly correlated with the fundamental frequency (f0) of the laryngeal source, which is determined largely by the length and mass of the vocal folds. The vocal folds are larger in adult males than in adult females, and men’s voices consequently have a lower pitch than women’s. The length of the supralaryngeal vocal tract (vocal-tract length; VTL) affects the resonant frequencies (formants) of speech which characterize the timbre of the voice. Men’s longer vocal tracts produce lower frequency, and less dispersed, formants than women’s shorter vocal tracts. Pitch and timbre combine to influence the perception of speaker characteristics such as size and age. Together, they can be used to categorize speaker sex with almost perfect accuracy. While it is known that domestic dogs can match a voice to a person of the same sex, there has been no investigation into whether dogs are sensitive to the correlation between pitch and timbre. We recorded a female voice giving three commands (‘Sit’, ‘Lay down’, ‘Come here’), and manipulated the recordings to lower the fundamental frequency (thus lowering pitch), increase simulated VTL (hence affecting timbre), or both (synthesized adult male voice). Dogs responded to the original adult female and synthesized adult male voices equivalently. Their tendency to obey the commands was, however, reduced when either pitch or timbre was manipulated alone. These results suggest that dogs are sensitive to both the pitch and timbre of human voices, and that they learn about the natural covariation of these perceptual attributes.
Journal Article
Vowel Acoustic Space Development in Children: A Synthesis of Acoustic and Anatomic Data
2007
Contact author: Houri K. Vorperian, 481 Waisman Center, 1500 Highland Avenue, Madison, WI 53705. E-mail: vorperian{at}waisman.wisc.edu .
Purpose: This article integrates published acoustic data on the development of vowel production. Age specific data on formant frequencies are considered in the light of information on the development of the vocal tract (VT) to create an anatomic–acoustic description of the maturation of the vowel acoustic space for English.
Method: Literature searches identified 14 studies reporting data on vowel formant frequencies. Data on corner vowels are summarized graphically to show age- and sex- related changes in the area and shape of the traditional vowel quadrilateral.
Conclusions: Vowel development is expressed as follows: (a) establishment of a language-appropriate acoustic representation (e.g., F1–F2 quadrilateral or F1–F2–F3 space), (b) gradual reduction in formant frequencies and F1–F2 area with age, (c) reduction in formant-frequency variability, (d) emergence of male–female differences in formant frequency by age 4 years with more apparent differences by 8 years, (e) jumps in formant frequency at ages corresponding to growth spurts of the VT, and (f) a decline of f 0 after age 1 year, with the decline being more rapid during early childhood and adolescence. Questions remain about optimal procedures for VT normalization and the exact relationship between VT growth and formant frequencies. Comments are included on nasalization and vocal fundamental frequency as they relate to the development of vowel production.
KEY WORDS: vowels, speech development, formant frequencies, nasalization, vocal fundamental frequency, vocal tract development
CiteULike Connotea Del.icio.us Digg Facebook Reddit Technorati Twitter What's this?
Journal Article
Clarification of the Acoustic Characteristics of Velopharyngeal Insufficiency by Acoustic Simulation Using the Boundary Element Method: A Pilot Study
by
Mishima, Katsuaki
,
Shiraishi, Mami
,
Umeda, Hirotsugu
in
Acoustic simulation
,
Acoustics
,
articulation
2025
A model of the vocal tract that mimicked velopharyngeal insufficiency was created, and acoustic analysis was performed using the boundary element method to clarify the acoustic characteristics of velopharyngeal insufficiency. The participants were six healthy adults. Computed tomography (CT) images were taken from the frontal sinus to the glottis during phonation of the Japanese vowels /i/ and /u/, and models of the vocal tracts were created from the CT data. To recreate velopharyngeal insufficiency, coupling of the nasopharynx was carried out in vocal tract models with no nasopharyngeal coupling, and the coupling site was enlarged in models with nasopharyngeal coupling. The vocal tract models were extended virtually for 12 cm in a cylindrical shape to represent the region from the lower part of the glottis to the tracheal bifurcation. The Kirchhoff–Helmholtz integral equation was used for the wave equation, and the boundary element method was used for discretization. Frequency response curves from 1 to 3000 Hz were calculated by applying the boundary element method. The curves showed the appearance of a pole–zero pair around 500 Hz, increased intensity around 250 Hz, decreased intensity around 500 Hz, decreased intensities of the first and second formants (F1 and F2), and a lower frequency of F2. Of these findings, increased intensity around 250 Hz, decreased intensity around 500 Hz, decreased intensities of F1 and F2, and lower frequency of F2 agree with the previously reported acoustic characteristics of hypernasality.
Journal Article
Formant frequencies and bandwidths of the vocal tract transfer function are affected by the mechanical impedance of the vocal tract wall
by
Fleischer, Mario
,
Mattheus, Willy
,
Mürbe, Dirk
in
Acoustic impedance
,
Acoustic properties
,
Acoustics
2015
The acoustical properties of the vocal tract, the air-filled cavity between the vocal folds and the mouth opening, are determined by its individual geometry, the physical properties of the air and of its boundaries. In this article, we address the necessity of complex impedance boundary conditions at the mouth opening and at the border of the acoustical domain inside the human vocal tract. Using finite element models based on MRI data for spoken and sung vowels /a/, /i/ and /
/ and comparison of the transfer characteristics by analysis of acoustical data using an inverse filtering method, the global wall impedance showed a frequency-dependent behaviour and depends on the produced vowel and therefore on the individual vocal tract geometry. The values of the normalised inertial component (represented by the imaginary part of the impedance) ranged from
250
g
/
m
2
at frequencies higher than about 3 kHz up to about
2.5
×
10
5
g
/
m
2
in the mid-frequency range around 1.5–3 kHz. In contrast, the normalised dissipation (represented by the real part of the impedance) ranged from
65
to
4.5
×
10
5
Ns
/
m
3
. These results indicate that structures enclosing the vocal tract (e.g. oral and pharyngeal mucosa and muscle tissues), especially their mechanical properties, influence the transfer of the acoustical energy and the position and bandwidth of the formant frequencies. It implies that the timbre characteristics of vowel sounds are likely to be tuned by specific control of relaxation and strain of the surrounding structures of the vocal tract.
Journal Article
In domain training data augmentation on noise robust Punjabi Children speech recognition
2022
For building a successful automatic speech recognition (ASR) engine large training data is required. It increases training complexity and become impossible for less resource language like Punjabi which have zero children corpus. Consequently, the issue of data scarcity, and small vocal length of children speakers also degrades the system performance under limited data conditions. Unfortunately, Punjabi is a tonal language and building an optimized ASR for such a language is near impossible. In this paper, we have explored fused feature extraction approach to handle large training complexity using mel frequency-gammatone frequency cepstral coefficient (MF-GFCC) technique through feature warping method. The efforts have been made to develop children’s ASR engine using data augmentation on limited data scenarios. For that purpose, we have studied in-domain data augmentation that artificially combined noisy and clean corpus to overcome the issue of data scarcity in train set. The combined dataset is processed with a fused feature extraction approach. Apart, the tonal characteristics and child vocal length issues are also overcome by inducing pitch features and train normalization strategy using vocal tract length normalization (VTLN) approach. In addition to that, combined augmented and original speech signals are noted to reduce the Word error rate (WER) performance with larger relative improvement (RI) of 20.59% on noisy and 19.39% on clean environment conditions using hybrid MF-GFCC approach than that on conventional Mel Frequency Cepstral Coefficient (MFCC) and Gammatone Frequency Cepstral Coefficient (GFCC) based ASR systems.
Journal Article
An Acoustic Simulation Method of the Japanese Vowels /i/ and /u/ by Using the Boundary Element Method
by
Mishima, Katsuaki
,
Shiraishi, Mami
,
Umeda, Hirotsugu
in
Acoustic simulation
,
Acoustics
,
articulation
2023
This study aimed to establish and verify the validity of an acoustic simulation method during sustained phonation of the Japanese vowels /i/ and /u/. The study participants were six healthy adults. First, vocal tract models were constructed based on computed tomography (CT) data, such as the range from the frontal sinus to the glottis, during sustained phonation of /i/ and /u/. To imitate the trachea, after being virtually extended by 12 cm, cylindrical shapes were then added to the vocal tract models between the tracheal bifurcation and the lower part of the glottis. Next, the boundary element method and the Kirchhoff–Helmholtz integral equation were used for discretization and to represent the wave equation for sound propagation, respectively. As a result, the relative discrimination thresholds of the vowel formant frequencies for /i/ and /u/ against actual voice were 1.1–10.2% and 0.4–9.3% for the first formant and 3.9–7.5% and 5.0–12.5% for the second formant, respectively. In the vocal tract model with nasal coupling, a pole–zero pair was observed at around 500 Hz, and for both /i/ and /u/, a pole–zero pair was observed at around 1000 Hz regardless of the presence or absence of nasal coupling. Therefore, the boundary element method, which produces solutions by analysis of boundary problems rather than three-dimensional aspects, was thought to be effective for simulating the Japanese vowels /i/ and /u/ with high validity for the vocal tract models encompassing a wide range, from the frontal sinuses to the trachea, constructed from CT data obtained during sustained phonation.
Journal Article
Pushes and pulls from below: Anatomical variation, articulation and sound change
2019
This paper argues that inter-individual and inter-group variation in language acquisition, perception, processing and production, rooted in our biology, may play a largely neglected role in sound change. We begin by discussing the patterning of these differences, highlighting those related to vocal tract anatomy with a foundation in genetics and development. We use our ArtiVarK database, a large multi-ethnic sample comprising 3D intraoral optical scans, as well as structural, static and real-time MRI scans of vocal tract anatomy and speech articulation, to quantify the articulatory strategies used to produce the North American English /r/ and to statistically show that anatomical factors seem to influence these articulatory strategies. Building on work showing that these alternative articulatory strategies may have indirect coarticulatory effects, we propose two models for how biases due to variation in vocal tract anatomy may affect sound change. The first involves direct overt acoustic effects of such biases that are then reinterpreted by the hearers, while the second is based on indirect coarticulatory phenomena generated by acoustically covert biases that produce overt “at-a-distance” acoustic effects. This view implies that speaker communities might be “poised” for change because they always contain pools of “standing variation” of such biased speakers, and when factors such as the frequency of the biased speakers in the community, their positions in the communicative network or the topology of the network itself change, sound change may rapidly follow as a self-reinforcing network-level phenomenon, akin to a phase transition. Thus, inter-speaker variation in structured and dynamic communicative networks may couple the initiation and actuation of sound change.
Journal Article