Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
25
result(s) for
"spectro‐temporal processing"
Sort by:
A surgeon‐scientist's perspective and review of cognitive‐linguistic contributions to adult cochlear implant outcomes
Objective(s) Enormous variability in speech recognition outcomes persists in adults who receive cochlear implants (CIs), which leads to a barrier to progress in predicting outcomes before surgery, explaining “poor” outcomes, and determining how to provide tailored rehabilitation therapy for individual CI users. The primary goal of my research program over the past 9 years has been to extend our understanding of the contributions of “top‐down” cognitive‐linguistic skills to CI outcomes in adults, acknowledging that “bottom‐up” sensory processes also contribute substantially. The main objective of this invited narrative review is to provide an overview of this work. A secondary objective is to provide career “guidance points” to budding surgeon‐scientists in Otolaryngology. Methods A narrative, chronological review covers work done by our group to explore top‐down and bottom‐up processing in adult CI outcomes. A set of ten guidance points is also provided to assist junior Otolaryngology surgeon‐scientists. Results Work in our lab has identified substantial contributions of cognitive skills (working memory, inhibition‐concentration, speed of lexical access, nonverbal reasoning, verbal learning and memory) as well as linguistic abilities (acoustic cue‐weighting, phonological sensitivity) to speech recognition outcomes in adults with CIs. These top‐down skills interact with the quality of the bottom‐up input. Conclusion Although progress has been made in understanding speech recognition variability in adult CI users, future work is needed to predict CI outcomes before surgery, to identify particular patients' strengths and weaknesses, and to tailor rehabilitation approaches for individual CI users. Level of Evidence 4 Enormous variability in speech recognition outcomes persists in adults who receive cochlear implants (CIs), which leads to a barrier to progress in predicting outcomes before surgery, explaining “poor” outcomes, and determining how to provide tailored rehabilitation therapy for individual CI users. The primary goal of my research program over the past nine years has been to extend our understanding of the contributions of “top‐down” cognitive‐linguistic skills to CI outcomes in adults, acknowledging that “bottom‐up” sensory processes also contribute substantially. The main objective of this invited narrative review is to provide an overview of this work, and a secondary objective is to provide career “guidance points” to budding surgeon‐scientists in Otolaryngology.
Journal Article
Corrigendum: Auditory tests for characterizing hearing deficits in listeners with various hearing abilities: The BEAR test battery
by
El-Haj-Ali, Mouhamad
,
Dau, Torsten
,
Bianchi, Federica
in
audiology
,
binaural processing
,
hearing loss
2023
[This corrects the article DOI: 10.3389/fnins.2021.724007.].
Journal Article
Auditory Tests for Characterizing Hearing Deficits in Listeners With Various Hearing Abilities: The BEAR Test Battery
by
El-Haj-Ali, Mouhamad
,
Dau, Torsten
,
Bianchi, Federica
in
audiology
,
binaural processing
,
hearing loss
2021
The Better hEAring Rehabilitation (BEAR) project aims to provide a new clinical profiling tool—a test battery—for hearing loss characterization. Although the loss of sensitivity can be efficiently measured using pure-tone audiometry, the assessment of supra-threshold hearing deficits remains a challenge. In contrast to the classical “attenuation-distortion” model, the proposed BEAR approach is based on the hypothesis that the hearing abilities of a given listener can be characterized along two dimensions, reflecting independent types of perceptual deficits (distortions). A data-driven approach provided evidence for the existence of different auditory profiles with different degrees of distortions. Ten tests were included in a test battery, based on their clinical feasibility, time efficiency, and related evidence from the literature. The tests were divided into six categories: audibility, speech perception, binaural processing abilities, loudness perception, spectro-temporal modulation sensitivity, and spectro-temporal resolution. Seventy-five listeners with symmetric, mild-to-severe sensorineural hearing loss were selected from a clinical population. The analysis of the results showed interrelations among outcomes related to high-frequency processing and outcome measures related to low-frequency processing abilities. The results showed the ability of the tests to reveal differences among individuals and their potential use in clinical settings.
Journal Article
Enhanced spectro-temporal feature extraction for prosthetic control using variational mode decomposition
2025
Rehabilitation systems play a vital role in improving the lifestyle of amputees or individuals with congenitally deficient limbs. However, the currently used pattern recognition techniques for different rehabilitation systems are highly noise sensitive. Inadequate feature extraction results in difficulty distinguishing similar gestures. Moreover, the implementation of previously proposed techniques on amputees is minimal. To address the issue of high sensitivity to noise, this study proposes a new spectro-temporal feature set for the enhanced implementation of rehabilitation systems. The efficacy of the proposed feature extraction technique was evaluated by comparing it with four other features aimed at addressing the broader applicability aspect. The Variational Mode Decomposition (VMD) of electromyography signals decomposes the signal into multiple variational mode functions (VMFs). A singular value representation is obtained from each VMF with a distinct spectral band using Singular Value Decomposition (SVD). The feature vectors employed consisted of a time-domain feature set, two spectro-temporal feature sets utilizing VMD and Empirical Mode Decomposition (EMD), and two combined feature sets. The results indicated that the proposed feature extraction technique outperforms the remaining feature sets with an accuracy of 97% for Dataset I, which comprises EMG recordings from 10 healthy subjects performing four hand and wrist motions. Statistical analysis revealed overall significance among all feature vectors (p-value < 0.05). The generalizability of the proposed technique was evaluated using Ninapro DB2 for healthy subjects with 49 motions, DB3 for amputees with 52 motions, and DB8 for both healthy and amputee subjects with 9 motions. The accuracies obtained for DB2, DB3, and DB9 were 91.43%, 95.66%, and 98.16% respectively. The analysis for the optimum number of VMFs revealed that 6 VMFs provide a reasonable tradeoff between accuracy and execution time.
Journal Article
Temporal Resolution Needed for Auditory Communication: Measurement With Mosaic Speech
by
Remijn, Gerard B.
,
Matsuda, Mizuki
,
Nakajima, Yoshitaka
in
Auditory communication
,
Communication
,
Frequency
2018
Temporal resolution needed for Japanese speech communication was measured. A new experimental paradigm that can reflect the spectro-temporal resolution necessary for healthy listeners to perceive speech is introduced. As a first step, we report listeners' intelligibility scores of Japanese speech with a systematically degraded temporal resolution, so-called \"mosaic speech\": speech mosaicized in the coordinates of time and frequency. The results of two experiments show that mosaic speech cut into short static segments was almost perfectly intelligible with a temporal resolution of 40 ms or finer. Intelligibility dropped for a temporal resolution of 80 ms, but was still around 50%-correct level. The data are in line with previous results showing that speech signals separated into short temporal segments of <100 ms can be remarkably robust in terms of linguistic-content perception against drastic manipulations in each segment, such as partial signal omission or temporal reversal. The human perceptual system thus can extract meaning from unexpectedly rough temporal information in speech. The process resembles that of the visual system stringing together static movie frames of ~40 ms into vivid motion.
Journal Article
ASTDT: an Interpretable Adaptive Spectro-Temporal Diffusion Transformer for audio deepfake detection
by
Qadri, Syed Asif Ahmad
,
Ashraf, Arselan
,
Wani, Taiba Maijd
in
Adaptive Spectro-Temporal Diffusion Transformer (ASTDT)
,
Advances in Information Forensics and Security
,
Artificial intelligence
2025
Advances in audio synthesis techniques have led to the creation of highly realistic audio deepfakes, posing growing threats to digital integrity and public trust. These synthetic manipulations mimic natural speech with high fidelity, making detection increasingly challenging and fueling the spread of misinformation, identity fraud, and voice-based attacks. To address these concerns, this study proposes the Adaptive Spectro-Temporal Diffusion Transformer (ASTDT), a novel detection framework that tackles key challenges in generalization, interpretability, and adaptability across diverse audio generation techniques. ASTDT integrates a score-based diffusion model to augment training spectrograms with realistic deepfake variations, improving generalization to unseen text-to-speech and voice conversion attacks. An adaptive spectro-temporal feature extraction mechanism partitions audio into interpretable frequency and temporal segments, while a dual-modal attention fusion module jointly processes magnitude and phase features. These fused features are processed by a transformer encoder with diffusion-aware attention, enabling effective modeling of long-range temporal dependencies. To enhance transparency, ASTDT includes an interpretability module that combines quantitative feature attributions and spatial heatmaps to explain model predictions. Experimental results across four benchmark datasets demonstrate the effectiveness of ASTDT, with the model achieving the lowest equal error rate of 1.20% on the ASVspoof 2019 dataset.
Journal Article
Temporal selectivity declines in the aging human auditory cortex
2020
Current models successfully describe the auditory cortical response to natural sounds with a set of spectro-temporal features. However, these models have hardly been linked to the ill-understood neurobiological changes that occur in the aging auditory cortex. Modelling the hemodynamic response to a rich natural sound mixture in N = 64 listeners of varying age, we here show that in older listeners’ auditory cortex, the key feature of temporal rate is represented with a markedly broader tuning. This loss of temporal selectivity is most prominent in primary auditory cortex and planum temporale, with no such changes in adjacent auditory or other brain areas. Amongst older listeners, we observe a direct relationship between chronological age and temporal-rate tuning, unconfounded by auditory acuity or model goodness of fit. In line with senescent neural dedifferentiation more generally, our results highlight decreased selectivity to temporal information as a hallmark of the aging auditory cortex. It can often be difficult for an older person to understand what someone is saying, particularly in noisy environments. Exactly how and why this age-related change occurs is not clear, but it is thought that older individuals may become less able to tune in to certain features of sound. Newer tools are making it easier to study age-related changes in hearing in the brain. For example, functional magnetic resonance imaging (fMRI) can allow scientists to ‘see’ and measure how certain parts of the brain react to different features of sound. Using fMRI data, researchers can compare how younger and older people process speech. They can also track how speech processing in the brain changes with age. Now, Erb et al. show that older individuals have a harder time tuning into the rhythm of speech. In the experiments, 64 people between the ages of 18 to 78 were asked to listen to speech in a noisy setting while they underwent fMRI. The researchers then tested a computer model using the data. In the older individuals, the brain’s tuning to the timing or rhythm of speech was broader, while the younger participants were more able to finely tune into this feature of sound. The older a person was the less able their brain was to distinguish rhythms in speech, likely making it harder to understand what had been said. This hearing change likely occurs because brain cells become less specialised overtime, which can contribute to many kinds of age-related cognitive decline. This new information about why understanding speech becomes more difficult with age may help scientists develop better hearing aids that are individualised to a person’s specific needs.
Journal Article
Exploiting spectro-temporal locality in deep learning based acoustic event detection
2015
In recent years, deep learning has not only permeated the computer vision and speech recognition research fields but also fields such as acoustic event detection (AED). One of the aims of AED is to detect and classify non-speech acoustic events occurring in conversation scenes including those produced by both humans and the objects that surround us. In AED, deep learning has enabled modeling of detail-rich features, and among these, high resolution spectrograms have shown a significant advantage over existing predefined features (e.g., Mel-filter bank) that compress and reduce detail. In this paper, we further asses the importance of feature extraction for deep learning-based acoustic event detection. AED, based on spectrogram-input deep neural networks, exploits the fact that sounds have \"global\" spectral patterns, but sounds also have \"local\" properties such as being more transient or smoother in the time-frequency domain. These can be exposed by adjusting the time-frequency resolution used to compute the spectrogram, or by using a model that exploits locality leading us to explore two different feature extraction strategies in the context of deep learning: (1) using multiple resolution spectrograms simultaneously and analyzing the overall and event-wise influence to combine the results, and (2) introducing the use of convolutional neural networks (CNN), a state of the art 2D feature extraction model that exploits local structures, with log power spectrogram input for AED. An experimental evaluation shows that the approaches we describe outperform our state-of-the-art deep learning baseline with a noticeable gain in the CNN case and provides insights regarding CNN-based spectrogram characterization for AED.
Journal Article
ATA-MSTF-Net: An Audio Texture-Aware MultiSpectro-Temporal Attention Fusion Network
by
Su, Yubo
,
Wang, Zhaoguo
,
Wang, Haolin
in
Acoustic properties
,
Algorithms
,
anomalous sound detection
2025
Unsupervised anomalous sound detection (ASD) models the normal sounds of machinery through classification operations, thereby identifying anomalies by quantifying deviations. Most recent approaches adopt depthwise separable modules from MobileNetV2. Extensive studies demonstrate that squeeze-and-excitation (SE) modules can enhance model fitting by dynamically weighting input features to adjust output distributions. However, we observe that conventional SE modules fail to adapt to the complex spectral textures of audio data. To address this, we propose an Audio Texture Attention (ATA) specifically designed for machine noise data, improving model robustness. Additionally, we integrate an LSTM layer and refine the temporal feature extraction architecture to strengthen the model’s sensitivity to sequential noise patterns. Experimental results on the DCASE 2020 Challenge Task 2 dataset show that our method achieves state-of-the-art performance, with AUC, pAUC, and mAUC scores of 96.15%, 90.58%, and 90.63%, respectively.
Journal Article
Spectro-Image Analysis with Vision Graph Neural Networks and Contrastive Learning for Parkinson’s Disease Detection
by
Yi, Myunggi
,
Hewage, Chaminda
,
Malekroodi, Hadi Sedigh
in
Accuracy
,
Acoustic properties
,
Analysis
2025
This study presents a novel framework that integrates Vision Graph Neural Networks (ViGs) with supervised contrastive learning for enhanced spectro-temporal image analysis of speech signals in Parkinson’s disease (PD) detection. The approach introduces a frequency band decomposition strategy that transforms raw audio into three complementary spectral representations, capturing distinct PD-specific characteristics across low-frequency (0–2 kHz), mid-frequency (2–6 kHz), and high-frequency (6 kHz+) bands. The framework processes mel multi-band spectro-temporal representations through a ViG architecture that models complex graph-based relationships between spectral and temporal components, trained using a supervised contrastive objective that learns discriminative representations distinguishing PD-affected from healthy speech patterns. Comprehensive experimental validation on multi-institutional datasets from Italy, Colombia, and Spain demonstrates that the proposed ViG-contrastive framework achieves superior classification performance, with the ViG-M-GELU architecture achieving 91.78% test accuracy. The integration of graph neural networks with contrastive learning enables effective learning from limited labeled data while capturing complex spectro-temporal relationships that traditional Convolution Neural Network (CNN) approaches miss, representing a promising direction for developing more accurate and clinically viable speech-based diagnostic tools for PD.
Journal Article