Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
      More Filters
      Clear All
      More Filters
      Source
    • Language
2,898 result(s) for "Lipreading"
Sort by:
Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices
Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable speech recognition, particularly when audio is corrupted by noise. Additional visual information can be used for both automatic lip-reading and gesture recognition. Hand gestures are a form of non-verbal communication and can be used as a very important part of modern human–computer interaction systems. Currently, audio and video modalities are easily accessible by sensors of mobile devices. However, there is no out-of-the-box solution for automatic audio-visual speech and gesture recognition. This study introduces two deep neural network-based model architectures: one for AVSR and one for gesture recognition. The main novelty regarding audio-visual speech recognition lies in fine-tuning strategies for both visual and acoustic features and in the proposed end-to-end model, which considers three modality fusion approaches: prediction-level, feature-level, and model-level. The main novelty in gesture recognition lies in a unique set of spatio-temporal features, including those that consider lip articulation information. As there are no available datasets for the combined task, we evaluated our methods on two different large-scale corpora—LRW and AUTSL—and outperformed existing methods on both audio-visual speech recognition and gesture recognition tasks. We achieved AVSR accuracy for the LRW dataset equal to 98.76% and gesture recognition rate for the AUTSL dataset equal to 98.56%. The results obtained demonstrate not only the high performance of the proposed methodology, but also the fundamental possibility of recognizing audio-visual speech and gestures by sensors of mobile devices.
SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams
We present SpeakingFaces as a publicly-available large-scale multimodal dataset developed to support machine learning research in contexts that utilize a combination of thermal, visual, and audio data streams; examples include human–computer interaction, biometric authentication, recognition systems, domain transfer, and speech recognition. SpeakingFaces is comprised of aligned high-resolution thermal and visual spectra image streams of fully-framed faces synchronized with audio recordings of each subject speaking approximately 100 imperative phrases. Data were collected from 142 subjects, yielding over 13,000 instances of synchronized data (∼3.8 TB). For technical validation, we demonstrate two baseline examples. The first baseline shows classification by gender, utilizing different combinations of the three data streams in both clean and noisy environments. The second example consists of thermal-to-visual facial image translation, as an instance of domain transfer.
Influence of surgical and N95 face masks on speech perception and listening effort in noise
Daily-life conversation relies on speech perception in quiet and noise. Because of the COVID-19 pandemic, face masks have become mandatory in many situations. Acoustic attenuation of sound pressure by the mask tissue reduces speech perception ability, especially in noisy situations. Masks also can impede the process of speech comprehension by concealing the movements of the mouth, interfering with lip reading. In this prospective observational, cross-sectional study including 17 participants with normal hearing, we measured the influence of acoustic attenuation caused by medical face masks (mouth and nose protection) according to EN 14683 and of N95 masks according to EN 1149 (EN 14683) on the speech recognition threshold and listening effort in various types of background noise. Averaged over all noise signals, a surgical mask significantly reduced the speech perception threshold in noise was by 1.6 dB (95% confidence interval [CI], 1.0, 2.1) and an N95 mask reduced it significantly by 2.7 dB (95% CI, 2.2, 3.2). Use of a surgical mask did not significantly increase the 50% listening effort signal-to-noise ratio (increase of 0.58 dB; 95% CI, 0.4, 1.5), but use of an N95 mask did so significantly, by 2.2 dB (95% CI, 1.2, 3.1). In acoustic measures, mask tissue reduced amplitudes by up to 8 dB at frequencies above 1 kHz, whereas no reduction was observed below 1 kHz. We conclude that face masks reduce speech perception and increase listening effort in different noise signals. Together with additional interference because of impeded lip reading, the compound effect of face masks could have a relevant impact on daily life communication even in those with normal hearing.