Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
165 result(s) for "acoustic scene analysis"
Sort by:
An Incremental Class-Learning Approach with Acoustic Novelty Detection for Acoustic Event Recognition
Acoustic scene analysis (ASA) relies on the dynamic sensing and understanding of stationary and non-stationary sounds from various events, background noises and human actions with objects. However, the spatio-temporal nature of the sound signals may not be stationary, and novel events may exist that eventually deteriorate the performance of the analysis. In this study, a self-learning-based ASA for acoustic event recognition (AER) is presented to detect and incrementally learn novel acoustic events by tackling catastrophic forgetting. The proposed ASA framework comprises six elements: (1) raw acoustic signal pre-processing, (2) low-level and deep audio feature extraction, (3) acoustic novelty detection (AND), (4) acoustic signal augmentations, (5) incremental class-learning (ICL) (of the audio features of the novel events) and (6) AER. The self-learning on different types of audio features extracted from the acoustic signals of various events occurs without human supervision. For the extraction of deep audio representations, in addition to visual geometry group (VGG) and residual neural network (ResNet), time-delay neural network (TDNN) and TDNN based long short-term memory (TDNN–LSTM) networks are pre-trained using a large-scale audio dataset, Google AudioSet. The performances of ICL with AND using Mel-spectrograms, and deep features with TDNNs, VGG, and ResNet from the Mel-spectrograms are validated on benchmark audio datasets such as ESC-10, ESC-50, UrbanSound8K (US8K), and an audio dataset collected by the authors in a real domestic environment.
Sound recurrence analysis for acoustic scene classification
In everyday life, people experience different soundscapes in which natural sounds, animal noises, and man-made sounds blend together. Although there have been several studies on the importance of recurring sound patterns in music and language, the relevance of this phenomenon in natural soundscapes is still largely unexplored. In this article, we study the repetition patterns of harmonic and transient sound events as potential cues for acoustic scene classification (ASC). In the first part of our study, our aim is to identify acoustic scene classes that exhibit characteristic sound repetition patterns concerning harmonic and transient sounds. We propose three metrics to measure the overall prevalence of sound repetitions as well as their repetition periods and temporal stability. In the second part, we evaluate three strategies to incorporate self-similarity matrices as an additional input feature to a convolutional neural network architecture for ASC. We observe the characteristic repetition of transient sounds in recordings of “park” and “street traffic” as well as harmonic sound repetitions in acoustic scene classes related to public transportation. In the ASC experiments, hybrid network architectures, which combine spectrogram features and features from sound recurrence analysis, show increased accuracy for those classes with prominent sound repetition patterns. Our findings provide additional perspective on the distinctions among acoustic scenes previously primarily ascribed in the literature to their spectral features.
Acoustic scene classification using inter- and intra-subarray spatial features in distributed microphone array
In this study, we investigate the effectiveness of spatial features in acoustic scene classification using distributed microphone arrays. Under the assumption that multiple subarrays, each equipped with microphones, are synchronized, we investigate two types of spatial feature: intra- and inter-generalized cross-correlation phase transforms (GCC-PHATs). These are derived from channels within the same subarray and between different subarrays, respectively. Our approach treats the log-Mel spectrogram as a spectral feature and intra- and/or inter-GCC-PHAT as a spatial feature. We propose two integration methods for spectral and spatial features: (a) middle integration, which fuses embeddings obtained by spectral and spatial features, and (b) late integration, which fuses decisions estimated using spectral and spatial features. The evaluation experiments showed that, when using only spectral features, employing all channels did not markedly improve the F1-score compared with the single-channel case. In contrast, integrating both spectral and spatial features improved the F1-score compared with using only spectral features. Additionally, we confirmed that the F1-score for late integration was slightly higher than that for middle integration.
End-to-end training of acoustic scene classification using distributed sound-to-light conversion devices: verification through simulation experiments
We propose a framework for classifying acoustic scenes utilizing distributed sound sensor devices capable of sound-to-light conversion, which we term as Blinkies . These Blinkies can convert acoustic signals into varying intensities of light via an inbuilt light-emitting diode. By using Blinkies, we can aggregate the spatial acoustic information across a wide region by recording the fluctuating light intensities of numerous Blinkies distributed throughout the region. Nonetheless, the signal communicated is subject to the bandwidth limitation imposed by the frame rate of the video camera, typically capped at 30 frames per second. Our objective is to refine the process of transforming sound into light for the purpose of acoustic scene classification within these bandwidth confines. While traversing through the air, a light signal is affected by inherent physical limitations such as the attenuation of light and interference from noise. To account for these factors, we have integrated these physical constraints into differentiable physical layers. This approach enables us to jointly train a pair of deep neural networks for the conversion of sound to light and for the classification of acoustic scenes. Our simulation studies, which employed the SINS database for acoustic scene classification, demonstrated that our proposed framework outperforms the previous one that utilized Blinkies. These findings emphasize the effectiveness of Blinkies in the field of acoustic scene classification.
Investigations on higher-order spherical harmonic input features for deep learning-based multiple speaker detection and localization
In this paper, a detailed investigation of deep learning-based speaker detection and localization (SDL) with higher-order Ambisonics signals is conducted. Different spherical harmonic (SH) input features such as the higher-order pseudointensity vector (HO-PIV), relative harmonic coefficients (RHCs), and the spatially-localized pseudointensity vector (SL-PIV), a feature proposed for the first time as an input feature for deep learning-based SDL, are examined using first- to fourth-order SH signals. The trained neural networks, optimized with a single loss function for the combined tasks of detection and localization, are then evaluated in detail for overall SDL performance as well as their performance in the sub-tasks of detection and, particularly, localization. The results are further analyzed in dependence on room reverberation, signal-to-interference ratio (SIR), as well as the number and distances between multiple simultaneously active speakers, utilizing both simulated and measured data. The findings indicate an overall improvement in SDL performance up to third-order Ambisonics for all investigated features, while using fourth-order signals does not yield any further improvement or sometimes even delivers worse results. Notably, the HO-PIV and the SL-PIV, both extensions of the first-order pseudointensity vector (FO-PIV), have proven to be suitable input features. In particular the newly proposed SL-PIV has been found to be the best of the investigated features on third- and fourth-order Ambisonics signals, especially in the most demanding scenarios on measured data, with multiple, closely located speakers and poor SIR.
Improving multi-talker binaural DOA estimation by combining periodicity and spatial features in convolutional neural networks
Deep neural network-based direction of arrival (DOA) estimation systems often rely on spatial features as input to learn a mapping for estimating the DOA of multiple talkers. Aiming to improve the accuracy of multi-talker DOA estimation for binaural hearing aids with a known number of active talkers, we investigate the usage of periodicity features as a footprint of speech signals in combination with spatial features as input to a convolutional neural network (CNN). In particular, we propose a multi-talker DOA estimation system employing a two-stage CNN architecture that utilizes cross-power spectrum (CPS) phase as spatial features and an auditory-inspired periodicity feature called periodicity degree (PD) as spectral features. The two-stage CNN incorporates a PD feature reduction stage prior to the joint processing of PD and CPS phase features. We investigate different design choices for the CNN architecture, including varying temporal reduction strategies and spectro-temporal filtering approaches. The performance of the proposed system is evaluated in static source scenarios with 2–3 talkers in two reverberant environments under varying signal-to-noise ratios using recorded background noises. To evaluate the benefit of combining PD features with CPS phase features, we consider baseline systems that utilize either only CPS phase features or combine CPS phase and magnitude spectrogram features. Results show that combining PD and CPS phase features in the proposed system consistently improves DOA estimation accuracy across all conditions, outperforming the two baseline systems. Additionally, the PD feature reduction stage in the proposed system improves DOA estimation accuracy while significantly reducing computational complexity compared to a baseline system without this stage, demonstrating its effectiveness for multi-talker DOA estimation.
DOA-informed switching independent vector extraction and beamforming for speech enhancement in underdetermined situations
This paper proposes novel methods for extracting a single Speech signal of Interest (SOI) from a multichannel observed signal in underdetermined situations, i.e., when the observed signal contains more speech signals than microphones. It focuses on extracting the SOI using prior knowledge of the SOI’s Direction of Arrival (DOA). Conventional beamformers (BFs) and Blind Source Separation (BSS) with spatial regularization struggle to suppress interference speech signals in such situations. Although Switching Minimum Power Distortionless Response BF (Sw-MPDR) can handle underdetermined situations using a switching mechanism, its estimation accuracy significantly decreases when it relies on a steering vector determined by the SOI’s DOA. Spatially-Regularized Independent Vector Extraction (SRIVE) can robustly enhance the SOI based solely on its DOA using spatial regularization, but its performance degrades in underdetermined situations. This paper extends these conventional methods to overcome their limitations. First, we introduce a time-varying Gaussian (TVG) source model to Sw-MPDR to effectively enhance the SOI based solely on the DOA. Second, we introduce the switching mechanism to SRIVE to improve its speech enhancement performance in underdetermined situations. These two proposed methods are called Switching weighted MPDR (Sw-wMPDR) and Switching SRIVE (Sw-SRIVE). We experimentally demonstrate that both surpass conventional methods in enhancing the SOI using the DOA in underdetermined situations.
An Unsupervised Deep Learning System for Acoustic Scene Analysis
Acoustic scene analysis has attracted a lot of attention recently. Existing methods are mostly supervised, which requires well-predefined acoustic scene categories and accurate labels. In practice, there exists a large amount of unlabeled audio data, but labeling large-scale data is not only costly but also time-consuming. Unsupervised acoustic scene analysis on the other hand does not require manual labeling but is known to have significantly lower performance and therefore has not been well explored. In this paper, a new unsupervised method based on deep auto-encoder networks and spectral clustering is proposed. It first extracts a bottleneck feature from the original acoustic feature of audio clips by an auto-encoder network, and then employs spectral clustering to further reduce the noise and unrelated information in the bottleneck feature. Finally, it conducts hierarchical clustering on the low-dimensional output of the spectral clustering. To fully utilize the spatial information of stereo audio, we further apply the binaural representation and conduct joint clustering on that. To the best of our knowledge, this is the first time that a binaural representation is being used in unsupervised learning. Experimental results show that the proposed method outperforms the state-of-the-art competing methods.
Statistics of natural reverberation enable perceptual separation of sound and space
In everyday listening, sound reaches our ears directly from a source as well as indirectly via reflections known as reverberation. Reverberation profoundly distorts the sound from a source, yet humans can both identify sound sources and distinguish environments from the resulting sound, via mechanisms that remain unclear. The core computational challenge is that the acoustic signatures of the source and environment are combined in a single signal received by the ear. Here we ask whether our recognition of sound sources and spaces reflects an ability to separate their effects and whether any such separation is enabled by statistical regularities of real-world reverberation. To first determine whether such statistical regularities exist, we measured impulse responses (IRs) of 271 spaces sampled from the distribution encountered by humans during daily life. The sampled spaces were diverse, but their IRs were tightly constrained, exhibiting exponential decay at frequency-dependent rates: Mid frequencies reverberated longest whereas higher and lower frequencies decayed more rapidly, presumably due to absorptive properties of materials and air. To test whether humans leverage these regularities, we manipulated IR decay characteristics in simulated reverberant audio. Listeners could discriminate sound sources and environments from these signals, but their abilities degraded when reverberation characteristics deviated from those of real-world environments. Subjectively, atypical IRs were mistaken for sound sources. The results suggest the brain separates sound into contributions from the source and the environment, constrained by a prior on natural reverberation. This separation process may contribute to robust recognition while providing information about spaces around us.
Is predictability salient? A study of attentional capture by auditory patterns
In this series of behavioural and electroencephalography (EEG) experiments, we investigate the extent to which repeating patterns of sounds capture attention. Work in the visual domain has revealed attentional capture by statistically predictable stimuli, consistent with predictive coding accounts which suggest that attention is drawn to sensory regularities. Here, stimuli comprised rapid sequences of tone pips, arranged in regular (REG) or random (RAND) patterns. EEG data demonstrate that the brain rapidly recognizes predictable patterns manifested as a rapid increase in responses to REG relative to RAND sequences. This increase is reminiscent of the increase in gain on neural responses to attended stimuli often seen in the neuroimaging literature, and thus consistent with the hypothesis that predictable sequences draw attention. To study potential attentional capture by auditory regularities, we used REG and RAND sequences in two different behavioural tasks designed to reveal effects of attentional capture by regularity. Overall, the pattern of results suggests that regularity does not capture attention. This article is part of the themed issue ‘Auditory and visual scene analysis’.