Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
      More Filters
      Clear All
      More Filters
      Source
    • Language
260 result(s) for "sound events processing"
Sort by:
Resource-Efficient Pet Dog Sound Events Classification Using LSTM-FCN Based on Time-Series Data
The use of IoT (Internet of Things) technology for the management of pet dogs left alone at home is increasing. This includes tasks such as automatic feeding, operation of play equipment, and location detection. Classification of the vocalizations of pet dogs using information from a sound sensor is an important method to analyze the behavior or emotions of dogs that are left alone. These sounds should be acquired by attaching the IoT sound sensor to the dog, and then classifying the sound events (e.g., barking, growling, howling, and whining). However, sound sensors tend to transmit large amounts of data and consume considerable amounts of power, which presents issues in the case of resource-constrained IoT sensor devices. In this paper, we propose a way to classify pet dog sound events and improve resource efficiency without significant degradation of accuracy. To achieve this, we only acquire the intensity data of sounds by using a relatively resource-efficient noise sensor. This presents issues as well, since it is difficult to achieve sufficient classification accuracy using only intensity data due to the loss of information from the sound events. To address this problem and avoid significant degradation of classification accuracy, we apply long short-term memory-fully convolutional network (LSTM-FCN), which is a deep learning method, to analyze time-series data, and exploit bicubic interpolation. Based on experimental results, the proposed method based on noise sensors (i.e., Shapelet and LSTM-FCN for time-series) was found to improve energy efficiency by 10 times without significant degradation of accuracy compared to typical methods based on sound sensors (i.e., mel-frequency cepstrum coefficient (MFCC), spectrogram, and mel-spectrum for feature extraction, and support vector machine (SVM) and k-nearest neighbor (K-NN) for classification).
Metrics for Polyphonic Sound Event Detection
This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.
Inherent auditory skills rather than formal music training shape the neural encoding of speech
Musical training is associated with a myriad of neuroplastic changes in the brain, including more robust and efficient neural processing of clean and degraded speech signals at brainstem and cortical levels. These assumptions stem largely from cross-sectional studies between musicians and nonmusicians which cannot address whether training itself is sufficient to induce physiological changes or whether preexisting superiority in auditory function before training predisposes individuals to pursue musical interests and appear to have similar neuroplastic benefits as musicians. Here, we recorded neuroelectric brain activity to clear and noise-degraded speech sounds in individuals without formal music training but who differed in their receptive musical perceptual abilities as assessed objectively via the Profile of Music Perception Skills. We found that listeners with naturally more adept listening skills (“musical sleepers”) had enhanced frequency-following responses to speech that were also more resilient to the detrimental effects of noise, consistent with the increased fidelity of speech encoding and speech-in-noise benefits observed previously in highly trained musicians. Further comparisons between these musical sleepers and actual trained musicians suggested that experience provides an additional boost to the neural encoding and perception of speech. Collectively, our findings suggest that the auditory neuroplasticity of music engagement likely involves a layering of both preexisting (nature) and experience-driven (nurture) factors in complex sound processing. In the absence of formal training, individuals with intrinsically proficient auditory systems can exhibit musician-like auditory function that can be further shaped in an experience-dependent manner.
Speech Processing in Autism Spectrum Disorder: An Integrative Review of Auditory Neurophysiology Findings
Purpose: Investigations into the nature of communication disorders in autistic individuals increasingly evaluate neural responses to speech stimuli. This integrative review aimed to consolidate the available data related to speech and language processing across levels of stimulus complexity (from single speech sounds to sentences) and to relate it to the current theories of autism. Method: An electronic database search identified peer-reviewed articles using event-related potentials or magnetoencephalography to investigate auditory processing from single speech sounds to sentences in autistic children and adults varying in language and cognitive abilities. Results: Atypical neural responses in autistic persons became more prominent with increasing stimulus and task complexity. Compared with their typically developing peers, autistic individuals demonstrated mostly intact sensory responses to single speech sounds, diminished spontaneous attentional orienting to spoken stimuli, specific difficulties with categorical speech sound discrimination, and reduced processing of semantic content. Atypical neural responses were more often observed in younger autistic participants and in those with concomitant language disorders. Conclusions: The observed differences in neural responses to speech stimuli suggest that communication difficulties in autistic individuals are more consistent with the reduced social interest than the auditory dysfunction explanation. Current limitations and future directions for research are also discussed.
Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection
Sound event localization and detection (SELD) is a crucial component of machine listening that aims to simultaneously identify and localize sound events in multichannel audio recordings. This task demands an integrated analysis of spatial, temporal, and frequency domains to accurately characterize sound events. The spatial domain pertains to the varying acoustic signals captured by multichannel microphones, which are essential for determining the location of sound sources. However, the majority of recent studies have focused on time-frequency correlations and spatio-temporal correlations separately, leading to inadequate performance in real-life scenarios. In this paper, we propose a novel SELD method that utilizes the newly developed Spatio-Temporal-Frequency Fusion Network (STFF-Net) to jointly learn comprehensive features across spatial, temporal, and frequency domains of sound events. The backbone of our STFF-Net is the Enhanced-3D (E3D) residual block, which combines 3D convolutions with a parameter-free attention mechanism to capture and refine the intricate correlations among these domains. Furthermore, our method incorporates the multi-ACCDOA format to effectively handle homogeneous overlaps between sound events. During the evaluation, we conduct extensive experiments on three de facto benchmark datasets, and our results demonstrate that the proposed SELD method significantly outperforms current state-of-the-art approaches.
Enhancing SELD Performance: The Role of Data Augmentation Techniques in Spatial Sound Analysis
Sound Event Localization and Detection (SELD) integrates Sound Event Detection (SED) and Direction-of-Arrival Estimation (DOAE) to recognize and localize sound events in various applications, including urban sound sensing, wildlife monitoring, and home surveillance. Recently, advancements in machine learning, particularly deep learning techniques, have demonstrated remarkable success in improving SELD performance. However, training deep learning models for SELD is challenged by the limited availability of high-quality spatial audio data, which is essential for accurate model generalization. This paper explores the effectiveness of data augmentation techniques in overcoming this limitation. We evaluate the impact of Frequency Shift (FS), Random Cutout (RC), and Channel Swapping (CS) on SELD performance using a comprehensive set of experiments. Our findings indicate that all tested augmentation combinations except FS alone significantly improve SELD performance, reducing the SELD error by approximately 8% compared to no augmentation. The differences among effective combinations are not statistically significant, suggesting that the decision to augment is more impactful than the specific combination chosen. This work highlights the critical role of data augmentation in enhancing SELD systems and suggests future research directions, including testing these techniques with different model architectures and exploring additional augmentation methods.
Cognitive neural responses in the semantic comprehension of sound symbolic words and pseudowords
Introduction: Sound symbolism is the phenomenon of sounds having non-arbitrary meaning, and it has been demonstrated that pseudowords with sound symbolic elements have similar meaning to lexical words. It is unclear how the impression given by the sound symbolic elements is semantically processed, in contrast to lexical words with definite meanings. In event-related potential (ERP) studies, phonological mapping negativity (PMN) and N400 are often used as measures of phonological and semantic processing, respectively. Therefore, in this study, we analyze PMN and N400 to clarify the differences between existing sound symbolic words (onomatopoeia or ideophones) and pseudowords in terms of semantic and phonological processing. Methods: An existing sound symbolic word and pseudowords were presented as an auditory stimulus in combination with a picture of an event, and PMN and N400 were measured while the subjects determined whether the sound stimuli and pictures match or mismatch. Results: In both the existing word and pseudoword tasks, the amplitude of PMN and N400 increased when the picture of an event and the speech sound did not match. Additionally, compared to the existing words, the pseudowords elicited a greater amplitude for PMN and N400. In addition, PMN latency was delayed in the mismatch condition relative to the match condition for both existing sound symbolic words and pseudowords. Discussion: We concluded that established sound symbolic words and sound symbolic pseudowords undergo similar semantic processing. This finding suggests that sound symbolism pseudowords are not judged on a simple impression level (e.g., spiky/round) or activated by other words with similar spellings (phonological structures) in the lexicon, but are judged on a similar contextual basis as actual words.
Repeated Parental Singing During Kangaroo Care Improved Neural Processing of Speech Sound Changes in Preterm Infants at Term Age
Preterm birth carries a risk for adverse neurodevelopment. Cognitive dysfunctions, such as language disorders may manifest as atypical sound discrimination already in early infancy. As infant-directed singing has been shown to enhance language acquisition in infants, we examined whether parental singing during skin-to-skin care (kangaroo care) improves speech sound discrimination in preterm infants. Forty-five preterm infants born between 26 and 33 gestational weeks (GW) and their parents participated in this cluster-randomized controlled trial ( ClinicalTrials ID IRB00003181SK). In both groups, parents conducted kangaroo care during 33–40 GW. In the singing intervention group ( n = 24), a certified music therapist guided parents to sing or hum during daily kangaroo care. In the control group ( n = 21), parents conducted standard kangaroo care and were not instructed to use their voices. Parents in both groups reported the duration of daily intervention. Auditory event-related potentials were recorded with electroencephalogram at term age using a multi-feature paradigm consisting of phonetic and emotional speech sound changes and a one-deviant oddball paradigm with pure tones. In the multi-feature paradigm, prominent mismatch responses (MMR) were elicited to the emotional sounds and many of the phonetic deviants in the singing intervention group and in the control group to some of the emotional and phonetic deviants. A group difference was found as the MMRs were larger in the singing intervention group, mainly due to larger MMRs being elicited to the emotional sounds, especially in females. The overall duration of the singing intervention (range 15–63 days) was positively associated with the MMR amplitudes for both phonetic and emotional stimuli in both sexes, unlike the daily singing time (range 8–120 min/day). In the oddball paradigm, MMRs for the non-speech sounds were elicited in both groups and no group differences nor connections between the singing time and the response amplitudes were found. These results imply that repeated parental singing during kangaroo care improved auditory discrimination of phonetic and emotional speech sounds in preterm infants at term age. Regular singing routines can be recommended for parents to promote the development of the auditory system and auditory processing of speech sounds in preterm infants.
A deep learning framework for environmental sound classification by fusing linear and nonlinear features
Environmental Sound Classification (ESC) is a crucial research direction in audio signal processing, aiming to identify and classify specific events in ambient sounds. Traditional methods typically rely on time–frequency features (e.g., inverse Mel spectrogram) fail to fully capture complex temporal patterns in sound signals. To address this critical issue, this paper employs Recurrence Plot (RP) to compensate for the deficiency of nonlinear features in time–frequency features. To validate the fused features and the RP of performance in ESC, comparative experiments were conducted on four different models (e.g., CNN and GRU) and two benchmark datasets. The model’s performance based on deep learning is enhanced through integrating linear and nonlinear feature. Furthermore, a novel SResNet architecture is proposed, which embeds an attention mechanism into the feature fusion process of ResNet-18 and incorporates the SwiGLU activation function to optimize residual blocks. The smoothing property of SwiGLU contributes to stabilize residual networks and accelerate convergence, enabling the capture of more intricate patterns. Experimental results demonstrate that the proposed feature fusion outperforms traditional linear feature fusion methods on both ESC-50 and UrbanSound8K datasets, thereby validating the robustness of RP in ESC tasks. Concurrently, SResNet also exhibits superior performance compared to direct feature fusion. This innovative approach of parallel feature fusion and model optimization advances environmental sound analysis, enabling more comprehensive and accurate representation of ambient sound data.