Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
2,898
result(s) for
"Lipreading"
Sort by:
Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices
by
Ivanko, Denis
,
Ryumin, Dmitry
,
Ryumina, Elena
in
Accuracy
,
Acoustics
,
audio-visual speech recognition
2023
Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable speech recognition, particularly when audio is corrupted by noise. Additional visual information can be used for both automatic lip-reading and gesture recognition. Hand gestures are a form of non-verbal communication and can be used as a very important part of modern human–computer interaction systems. Currently, audio and video modalities are easily accessible by sensors of mobile devices. However, there is no out-of-the-box solution for automatic audio-visual speech and gesture recognition. This study introduces two deep neural network-based model architectures: one for AVSR and one for gesture recognition. The main novelty regarding audio-visual speech recognition lies in fine-tuning strategies for both visual and acoustic features and in the proposed end-to-end model, which considers three modality fusion approaches: prediction-level, feature-level, and model-level. The main novelty in gesture recognition lies in a unique set of spatio-temporal features, including those that consider lip articulation information. As there are no available datasets for the combined task, we evaluated our methods on two different large-scale corpora—LRW and AUTSL—and outperformed existing methods on both audio-visual speech recognition and gesture recognition tasks. We achieved AVSR accuracy for the LRW dataset equal to 98.76% and gesture recognition rate for the AUTSL dataset equal to 98.56%. The results obtained demonstrate not only the high performance of the proposed methodology, but also the fundamental possibility of recognizing audio-visual speech and gestures by sensors of mobile devices.
Journal Article
SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams
2021
We present SpeakingFaces as a publicly-available large-scale multimodal dataset developed to support machine learning research in contexts that utilize a combination of thermal, visual, and audio data streams; examples include human–computer interaction, biometric authentication, recognition systems, domain transfer, and speech recognition. SpeakingFaces is comprised of aligned high-resolution thermal and visual spectra image streams of fully-framed faces synchronized with audio recordings of each subject speaking approximately 100 imperative phrases. Data were collected from 142 subjects, yielding over 13,000 instances of synchronized data (∼3.8 TB). For technical validation, we demonstrate two baseline examples. The first baseline shows classification by gender, utilizing different combinations of the three data streams in both clean and noisy environments. The second example consists of thermal-to-visual facial image translation, as an instance of domain transfer.
Journal Article
Influence of surgical and N95 face masks on speech perception and listening effort in noise
2021
Daily-life conversation relies on speech perception in quiet and noise. Because of the COVID-19 pandemic, face masks have become mandatory in many situations. Acoustic attenuation of sound pressure by the mask tissue reduces speech perception ability, especially in noisy situations. Masks also can impede the process of speech comprehension by concealing the movements of the mouth, interfering with lip reading. In this prospective observational, cross-sectional study including 17 participants with normal hearing, we measured the influence of acoustic attenuation caused by medical face masks (mouth and nose protection) according to EN 14683 and of N95 masks according to EN 1149 (EN 14683) on the speech recognition threshold and listening effort in various types of background noise. Averaged over all noise signals, a surgical mask significantly reduced the speech perception threshold in noise was by 1.6 dB (95% confidence interval [CI], 1.0, 2.1) and an N95 mask reduced it significantly by 2.7 dB (95% CI, 2.2, 3.2). Use of a surgical mask did not significantly increase the 50% listening effort signal-to-noise ratio (increase of 0.58 dB; 95% CI, 0.4, 1.5), but use of an N95 mask did so significantly, by 2.2 dB (95% CI, 1.2, 3.1). In acoustic measures, mask tissue reduced amplitudes by up to 8 dB at frequencies above 1 kHz, whereas no reduction was observed below 1 kHz. We conclude that face masks reduce speech perception and increase listening effort in different noise signals. Together with additional interference because of impeded lip reading, the compound effect of face masks could have a relevant impact on daily life communication even in those with normal hearing.
Journal Article