Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
      More Filters
      Clear All
      More Filters
      Source
    • Language
8,836 result(s) for "speech classification"
Sort by:
Exploring Spectrogram-Based Audio Classification for Parkinson’s Disease: A Study on Speech Classification and Qualitative Reliability Verification
Patients suffering from Parkinson’s disease suffer from voice impairment. In this study, we introduce models to classify normal and Parkinson’s patients using their speech. We used an AST (audio spectrogram transformer), a transformer-based speech classification model that has recently outperformed CNN-based models in many fields, and a CNN-based PSLA (pretraining, sampling, labeling, and aggregation), a high-performance model in the existing speech classification field, for the study. This study compares and analyzes the models from both quantitative and qualitative perspectives. First, qualitatively, PSLA outperformed AST by more than 4% in accuracy, and the AUC was also higher, with 94.16% for AST and 97.43% for PSLA. Furthermore, we qualitatively evaluated the ability of the models to capture the acoustic features of Parkinson’s through various CAM (class activation map)-based XAI (eXplainable AI) models such as GradCAM and EigenCAM. Based on PSLA, we found that the model focuses well on the muffled frequency band of Parkinson’s speech, and the heatmap analysis of false positives and false negatives shows that the speech features are also visually represented when the model actually makes incorrect predictions. The contribution of this paper is that we not only found a suitable model for diagnosing Parkinson’s through speech using two different types of models but also validated the predictions of the model in practice.
A review on speech processing using machine learning paradigm
Speech processing plays a crucial role in many signal processing applications, while the last decade has bought gigantic evolution based on machine learning prototype. Speech processing has a close relationship with computer linguistics, human–machine interaction, natural language processing, and psycholinguistics. This review article majorly discusses the feature extraction techniques and machine learning classifiers employed in speech processing and recognition activities. The performance of several machine learning techniques is validated for speech emotion recognition application on Berlin EmoDB database. Further, it gives the broad application areas and challenges in machine learning for speech processing.
Development of novel automated language classification model using pyramid pattern technique with speech signals
Language classification using speeches is a complex issue in machine learning and pattern recognition. Various text and image-based language classification methods have been presented. But there are limited speech-based language classification methods in the literature. Also, the previously presented models classified limited numbers of languages, and few are accents. This work presents an automated handcrafted language classification model. The novel pyramid pattern is presented to extract the features extraction. Also, statistical features and maximum pooling are used to generate the features. We have developed our speech-language classification model using two datasets: (i) created a new big speech dataset containing 14,500 speeches in 29 languages, and (ii) used the VoxForge dataset. The neighborhood component analysis method is used to select the most informative 1000 features from the generated features, and these features are classified using a quadratic support vector machine classifier (QSVM). Our developed method yielded 98.87 ± 0.30% and 97.12 ± 1.27% accuracies for our and VoxForge datasets, respectively. Also, geometric mean, average precision, and F1-score evaluation parameters are calculated, and they are presented in the results section. This paper presents an accurate language classification model developed using two big speech-language datasets. Our results indicate the success of the proposed pyramid pattern-based language classification method in classifying various speech languages accurately.
From pronounced to imagined: improving speech decoding with multi-condition EEG data
decoding using EEG holds promising applications for individuals with motor neuron diseases, although its performance remains limited due to small dataset sizes and the absence of sensory feedback. Here, we investigated whether incorporating EEG data from (pronounced) speech could enhance classification. Our approach systematically compares four classification scenarios by modifying the training dataset: intra-subject (using only , combining and , and using only ) and multi-subject (combining data from different participants with the of the target participant). We implemented all scenarios using the convolutional neural network EEGNet. To this end, twenty-four healthy participants pronounced and imagined five Spanish words. In binary word-pair classifications, combining and data in the intra-subject scenario led to accuracy improvements of 3%-5.17% in four out of 10 word pairs, compared to training with only. Although the highest individual accuracy (95%) was achieved with alone, the inclusion of data allowed more participants to surpass 70% accuracy, increasing from 10 ( ) to 15 participants. In the intra-subject multi-class scenario, combining and did not yield statistically significant improvements over using exclusively. Finally, we observed that features such as word length, phonological complexity, and frequency of use contributed to higher discriminability between certain word pairs. These findings suggest that incorporating data can improve decoding in individualized models, offering a feasible strategy to support the early adoption of brain-computer interfaces before speech deterioration occurs in individuals with motor neuron diseases.
Innovative Speech-Based Deep Learning Approaches for Parkinson’s Disease Classification: A Systematic Review
Parkinson’s disease (PD), the second most prevalent neurodegenerative disorder worldwide, frequently presents with early-stage speech impairments. Recent advancements in Artificial Intelligence (AI), particularly deep learning (DL), have significantly enhanced PD diagnosis through the analysis of speech data. Nevertheless, the progress of research is restricted by the limited availability of publicly accessible speech-based PD datasets, primarily due to privacy concerns. The goal of this systematic review is to explore the current landscape of speech-based DL approaches for PD classification, based on 33 scientific works published between January 2020 and March 2024. We discuss their available resources, capabilities, and potential limitations, and issues related to bias, explainability, and privacy. Furthermore, this review provides an overview of publicly accessible speech-based datasets and open-source material for PD. The DL approaches identified are categorized into end-to-end (E2E) learning, transfer learning (TL), and deep acoustic feature extraction (DAFE). Among E2E approaches, Convolutional Neural Networks (CNNs) are prevalent, though Transformers are increasingly popular. E2E approaches face challenges such as limited data and computational resources, especially with Transformers. TL addresses these issues by providing more robust PD diagnosis and better generalizability across languages. DAFE aims to improve the explainability and interpretability of results by examining the specific effects of deep features on both other DL approaches and more traditional machine learning (ML) methods. However, it often underperforms compared to E2E and TL approaches.
Characteristics of Disfluency Clusters Over Time in Preschool Children Who Stutter
Purpose: Disfluency clusters in preschool children were analyzed to determine whether they occurred at rates above chance, whether they changed over time, and whether they could differentiate children who would later persist in, or recover from, stuttering. Method: Thirty-two children recruited near stuttering onset were grouped on the basis of their eventual course of stuttering and matched to 16 normally fluent children. Clusters were classified as stuttering-like disfluencies (SLD), other disfluencies (OD), or mixed (SLD and OD combined). Cluster frequency and length were calculated for all children and again after 6 months for those who stuttered. Results: Clusters occurred at rates greater than chance for both stuttering and normally fluent children. Children who stuttered had significantly more and longer clusters than did normally fluent children. Close to stuttering onset, clusters did not differentiate the course of stuttering. Cluster frequency and length decreased over time for children in the persistent and recovered groups. The proportion of disfluencies in clusters was significantly lower in the recovered group than it was in the persistent group after 6 months. Conclusions: Clusters are an integral part of disfluent speech in preschool children in general. Although they do not serve as indicators of recovery or persistency at the onset of stuttering, they may have some prognostic value several months later.
Articulatory-to-Acoustic Relations in Response to Speaking Rate and Loudness Manipulations
Purpose: In this investigation, the authors determined the strength of association between tongue kinematic and speech acoustics changes in response to speaking rate and loudness manipulations. Performance changes in the kinematic and acoustic domains were measured using two aspects of speech production presumably affecting speech clarity: phonetic specification and variability. Method: Tongue movements for the vowels /ia/ were recorded in 10 healthy adults during habitual, fast, slow, and loud speech using three-dimensional electromagnetic articulography. To determine articulatory-to-acoustic relations for phonetic specification, the authors correlated changes in lingual displacement with changes in acoustic vowel distance. To determine articulatory-to-acoustic relations for phonetic variability, the authors correlated changes in lingual movement variability with changes in formant movement variability. Results: A significant positive linear association was found for kinematic and acoustic specification but not for kinematic and acoustic variability. Several significant speaking task effects were also observed. Conclusion: Lingual displacement is a good predictor of acoustic vowel distance in healthy talkers. The weak association between kinematic and acoustic variability raises questions regarding the effects of articulatory variability on speech clarity and intelligibility, particularly in individuals with motor speech disorders.
Significance of relative phase features for shouted and normal speech classification
Shouted and normal speech classification plays an important role in many speech-related applications. The existing works are often based on magnitude-based features and ignore phase-based features, which are directly related to magnitude information. In this paper, the importance of phase-based features is explored for the detection of shouted speech. The novel contributions of this work are as follows. (1) Three phase-based features, namely, relative phase (RP), linear prediction analysis estimated speech-based RP (LPAES-RP) and linear prediction residual-based RP (LPR-RP) features, are explored for shouted and normal speech classification. (2) We propose a new RP feature, called the glottal source-based RP (GRP) feature. The main idea of the proposed GRP feature is to exploit the difference between RP and LPAES-RP features to detect shouted speech. (3) A score combination of phase- and magnitude-based features is also employed to further improve the classification performance. The proposed feature and combination are evaluated using the shouted normal electroglottograph speech (SNE-Speech) corpus. The experimental findings show that the RP, LPAES-RP, and LPR-RP features provide promising results for the detection of shouted speech. We also find that the proposed GRP feature can provide better results than those of the standard mel-frequency cepstral coefficient (MFCC) feature. Moreover, compared to using individual features, the score combination of the MFCC and RP/LPAES-RP/LPR-RP/GRP features yields an improved detection performance. Performance analysis under noisy environments shows that the score combination of the MFCC and the RP/LPAES-RP/LPR-RP features gives more robust classification. These outcomes show the importance of RP features in distinguishing shouted speech from normal speech.
Fidelity of Automatic Speech Processing for Adult and Child Talker Classifications
Automatic speech processing (ASP) has recently been applied to very large datasets of naturalistically collected, daylong recordings of child speech via an audio recorder worn by young children. The system developed by the LENA Research Foundation analyzes children's speech for research and clinical purposes, with special focus on of identifying and tagging family speech dynamics and the at-home acoustic environment from the auditory perspective of the child. A primary issue for researchers, clinicians, and families using the Language ENvironment Analysis (LENA) system is to what degree the segment labels are valid. This classification study evaluates the performance of the computer ASP output against 23 trained human judges who made about 53,000 judgements of classification of segments tagged by the LENA ASP. Results indicate performance consistent with modern ASP such as those using HMM methods, with acoustic characteristics of fundamental frequency and segment duration most important for both human and machine classifications. Results are likely to be important for interpreting and improving ASP output.
MARBERT-LSTM-Attention: A Hybrid Transformer Framework for Multi-Class Arabic Hate Speech Classification
Detecting hate speech in Arabic presents distinct computational challenges due to the complexity of the Arabic language and the scarcity of annotated datasets. This paper introduces MARBERT-LSTM-Attention, a hybrid transformer-based architecture that combines a pre-trained MARBERT model with BiLSTM layers and attention mechanisms, offering a powerful solution for accurate hate speech detection. We further present a novel multi-label Arabic hate speech dataset with 16,051 samples, created by merging and re-annotating the OSACT5 and MLMA corpora with three native Arabic experts, resolving ambiguities in some classes, and standardizing labels across seven sociocultural categories: disability, gender, ideology, origin, religion, social class, and NOT HS, thereby enabling clear distinctions between offensive language, general hate speech, and fine-grained target categories. Our experimental results indicate that the proposed model delivers cutting-edge performance, with weighted F1 scores of 91.1% for binary hate speech, 87.9% for multi-class classification, and 86.4% for offensive language detection. These results outperform baseline models, such as ML-based and DL-based models, as well as advanced transformer-based architectures, such as mBERT, mDistilBERT, AraBERT, CAMeLBERT, and MARBERT by a margin of 3.8%-16.4%. Statistical analyses confirmed that these improvements were statistically significant, particularly for minority classes. Through the implementation of data augmentation techniques, we effectively mitigated class imbalance, resulting in an enhancement of recall for minority classes by 12-33% while preserving high accuracy for majority classes, as evidenced by an F1-score of 91.7 for the NOT HS class.