Catalogue Search | MBRL

On the Importance of Passive Acoustic Monitoring Filters

by Maguolo, Gianluca , Aguiar, Rafael , Silla, Carlos in Accuracy , Acoustic tracking , Acoustics

2021

Passive acoustic monitoring (PAM) is a noninvasive technique to supervise wildlife. Acoustic surveillance is preferable in some situations such as in the case of marine mammals, when the animals spend most of their time underwater, making it hard to obtain their images. Machine learning is very useful for PAM, for example to identify species based on audio recordings. However, some care should be taken to evaluate the capability of a system. We defined PAM filters as the creation of the experimental protocols according to the dates and locations of the recordings, aiming to avoid the use of the same individuals, noise patterns, and recording devices in both the training and test sets. It is important to remark that the filters proposed here were not intended to improve the accuracy rates. Indeed, these filters tended to make it harder to obtain better rates, but at the same time, they tended to provide more reliable results. In our experiments, a random division of a database presented accuracies much higher than accuracies obtained with protocols generated with PAM filters, which indicates that the classification system learned other components presented in the audio. Although we used the animal vocalizations, in our method, we converted the audio into spectrogram images, and after that, we described the images using the texture. These are well-known techniques for audio classification, and they have already been used for species classification. Furthermore, we performed statistical tests to demonstrate the significant difference between the accuracies generated with and without PAM filters with several well-known classifiers. The configuration of our experimental protocols and the database were made available online.

Journal Article

Share this book

Add to My Shelf

Introduction to audiovisual archives

by Stockinger, Peter in Audio-visual archives , Audio-visual materials , Audio-visual materials--Classification

2013,2012

Today, audiovisual archives and libraries have become very popular especially in the field of collecting, preserving and transmitting cultural heritage. However, the data from these archives or libraries – videos, images, sound tracks, etc. – constitute as such only potential cognitive resources for a given public (or “target community”). They have to undergo more or less significant qualitative transformations in order to become user- or community-relevant intellectual goods. These qualitative transformations are performed through a series of concrete operations such as: audiovisual text segmentation, content description and indexing, pragmatic profiling, translation, etc. These and other operations constitute what we call the semiotic turn in dealing with digital (audiovisual) texts, corpora of texts or even entire (audiovisual) archives and libraries. They demonstrate practically and theoretically the well-known “from data to meta-data” or “from (simple) information to (relevant) knowledge” problem – a problem that obviously directly influences the effective use, the social impact and relevancy and therefore also the future of digital knowledge archives. It constitutes, indeed, the heart of a diversity of important R&D programs and projects all over the world.

eBook

Share this book

Add to My Shelf

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

by Gourisaria, Mahendra Kumar , Agrawal, Rakshit , Sahni, Manoj in Accuracy , Algorithms , Artificial Neural Network

2024

In the era of automated and digitalized information, advanced computer applications deal with a major part of the data that comprises audio-related information. Advancements in technology have ushered in a new era where cutting-edge devices can deliver comprehensive insights into audio content, leveraging sophisticated algorithms such such as Mel Frequency Cepstral Coefficients (MFCCs) and Short-Time Fourier Transform (STFT) to extract and provide pertinent information. Our study helps in not only efficient audio file management and audio file retrievals but also plays a vital role in security, the robotics industry, and investigations. Beyond its industrial applications, our model exhibits remarkable versatility in the corporate sector, particularly in tasks like siren sound detection and more. Embracing this capability holds the promise of catalyzing the development of advanced automated systems, paving the way for increased efficiency and safety across various corporate domains. The primary aim of our experiment is to focus on creating highly efficient audio classification models that can be seamlessly automated and deployed within the industrial sector, addressing critical needs for enhanced productivity and performance. Despite the dynamic nature of environmental sounds and the presence of noises, our presented audio classification model comes out to be efficient and accurate. The novelty of our research work reclines to compare two different audio datasets having similar characteristics and revolves around classifying the audio signals into several categories using various machine learning techniques and extracting MFCCs and STFTs features from the audio signals. We have also tested the results after and before the noise removal for analyzing the effect of the noise on the results including the precision, recall, specificity, and F1-score. Our experiment shows that the ANN model outperforms the other six audio models with the accuracy of 91.41% and 91.27% on respective datasets.

Journal Article

Share this book

Add to My Shelf

Optimizing poultry audio signal classification with deep learning and burn layer fusion

by Hassan, Esraa , Abd El-Hafeez, Tarek , Elbedwehy, Samar in Algorithms , Animal health , Animal husbandry

2024

This study introduces a novel deep learning-based approach for classifying poultry audio signals, incorporating a custom Burn Layer to enhance model robustness. The methodology integrates digital audio signal processing, convolutional neural networks (CNNs), and the innovative Burn Layer, which injects controlled random noise during training to reinforce the model's resilience to input signal variations. The proposed architecture is streamlined, with convolutional blocks, densely connected layers, dropout, and an additional Burn Layer to fortify robustness. The model demonstrates efficiency by reducing trainable parameters to 191,235, compared to traditional architectures with over 1.7 million parameters. The proposed model utilizes a Burn Layer with burn intensity as a parameter and an Adamax optimizer to optimize and address the overfitting problem. Thorough evaluation using six standard classification metrics showcases the model's superior performance, achieving exceptional sensitivity (96.77%), specificity (100.00%), precision (100.00%), negative predictive value (NPV) (95.00%), accuracy (98.55%), F1 score (98.36%), and Matthew’s correlation coefficient (MCC) (95.88%). This research contributes valuable insights into the fields of audio signal processing, animal health monitoring, and robust deep-learning classification systems. The proposed model presents a systematic approach for developing and evaluating a deep learning-based poultry audio classification system. It processes raw audio data and labels to generate digital representations, utilizes a Burn Layer for training variability, and constructs a CNN model with convolutional blocks, pooling, and dense layers. The model is optimized using the Adamax algorithm and trained with data augmentation and early-stopping techniques. Rigorous assessment on a test dataset using standard metrics demonstrates the model's robustness and efficiency, with the potential to significantly advance animal health monitoring and disease detection through audio signal analysis.

Journal Article

Share this book

Add to My Shelf

Sound Classification and Processing of Urban Environments: A Systematic Literature Review

by Oliveira, Hugo S. , Nogueira, Ana Filipa Rodrigues , Machado, José J. M. in Algorithms , attention mechanisms , audio classification

2022

Audio recognition can be used in smart cities for security, surveillance, manufacturing, autonomous vehicles, and noise mitigation, just to name a few. However, urban sounds are everyday audio events that occur daily, presenting unstructured characteristics containing different genres of noise and sounds unrelated to the sound event under study, making it a challenging problem. Therefore, the main objective of this literature review is to summarize the most recent works on this subject to understand the current approaches and identify their limitations. Based on the reviewed articles, it can be realized that Deep Learning (DL) architectures, attention mechanisms, data augmentation techniques, and pretraining are the most crucial factors to consider while creating an efficient sound classification model. The best-found results were obtained by Mushtaq and Su, in 2020, using a DenseNet-161 with pretrained weights from ImageNet, and NA-1 and NA-2 as augmentation techniques, which were of 97.98%, 98.52%, and 99.22% for UrbanSound8K, ESC-50, and ESC-10 datasets, respectively. Nonetheless, the use of these models in real-world scenarios has not been properly addressed, so their effectiveness is still questionable in such situations.

Journal Article

Share this book

Add to My Shelf

Deep Neural Networks-based Classification Methodologies of Speech, Audio and Music, and its Integration for Audio Metadata Tagging

by Park, Hosung , Kim, Ji-Hwan , Chung, Yoonseo in Artificial neural networks , Audio data , Automatic speech recognition

2023

Videos contain visual and auditory information. Visual information in a video can include images of people, objects, and the landscape, whereas auditory information includes voices, sound effects, background music, and the soundscape. The audio content can provide detailed information on the story by conducting a voice and atmosphere analysis of the sound effects and soundscape. Metadata tags represent the results of a media analysis as text. The tags can classify video content on social networking services, like YouTube. This paper presents the methodologies of speech, audio, and music processing. Also, we propose integrating these audio tagging methods and applying them in an audio metadata generation system for video storytelling. The proposed system automatically creates metadata tags based on speech, sound effects, and background music information from the audio input. The proposed system comprises five subsystems: (1) automatic speech recognition, which generates text from the linguistic sounds in the audio, (2) audio event classification for the type of sound effect, (3) audio scene classification for the type of place from the soundscape, (4) music detection for the background music, and (5) keyword extraction from the automatic speech recognition results. First, the audio signal is converted into a suitable form, which is subsequently combined from each subsystem to create metadata for the audio content. We evaluated the proposed system using video logs (vlogs) on YouTube. The proposed system exhibits a similar accuracy to handcrafted metadata for the audio content, and for a total of 104 YouTube vlogs, achieves an accuracy of 65.83%.

Journal Article

Share this book

Add to My Shelf

Hydraulic Seal Wear Classification by Fine-Tuning a Transformer-Based Audio Model Using Acoustic Emission

by Svendsen, Lisa Maria , Shanbhag, Vignesh V. , Schlanbusch, Rune in Accuracy , acoustic emission , Acoustic emission testing

2026

Accurate classification of seal wear is essential for condition-based and predictive maintenance of hydraulic cylinders, where seal degradation can cause fluid leakage and impair normal system operation. This study investigates the adaptation of a Transformer-based audio model for classifying seal wear conditions using acoustic emission (AE) signals. Specifically, we adapt the Audio Spectrogram Transformer (AST), a convolution-free, purely attention-based model that operates directly on audio spectrograms. The Transformer architecture enables the modeling of long-range dependencies, while the model learns discriminative representations directly from AE data without relying on manually engineered features. A selective fine-tuning strategy was implemented by adding layer-freezing functionality to the AST training pipeline, enabling different freezing configurations during fine-tuning. This allowed earlier pretrained representations to be preserved while adapting the later layers to the target AE signals, thereby reducing the risk of overfitting in the small-data setting. In addition, validation-driven early stopping was implemented to further improve generalization during fine-tuning. The model was initialized with ImageNet and AudioSet pretrained weights to exploit general-purpose representations learned from large-scale datasets. The AE data were acquired under varying pressure conditions on a hydraulic test rig designed to simulate hydraulic cylinder leakage. The datasets were partitioned into fine-tuning, validation, and evaluation subsets and labeled into three wear states: unworn, semi-worn, and worn. In addition, data augmentation techniques were applied to the fine-tuning data to increase diversity and mitigate class imbalance. The adapted model achieved 97.92% classification accuracy across all wear conditions and pressure settings, demonstrating its ability to learn discriminative wear-related patterns directly from AE data. Furthermore, the framework’s versatility was further assessed on a bearing strip dataset acquired from the same hydraulic test rig. Using the same fine-tuning configuration, the model achieved 95.65% accuracy and 100% recall for the worn state. These findings highlight the potential of transformer-based architectures for data-efficient, end-to-end AE-based diagnostics across hydraulic system components.

Journal Article

Share this book

Add to My Shelf

Spectrogram based multi-task audio classification

by Zeng, Yuni , Mao, Hua , Zhang, Yi in Artificial neural networks , Classification , Convolution

2019

Audio classification is regarded as a great challenge in pattern recognition. Although audio classification tasks are always treated as independent tasks, tasks are essentially related to each other such as speakers’ accent and speakers’ identification. In this paper, we propose a Deep Neural Network (DNN)-based multi-task model that exploits such relationships and deals with multiple audio classification tasks simultaneously. We term our model as the gated Residual Networks (GResNets) model since it integrates Deep Residual Networks (ResNets) with a gate mechanism, which extract better representations between tasks compared with Convolutional Neural Networks (CNNs). Specifically, two multiplied convolutional layers are used to replace two feed-forward convolution layers in the ResNets. We tested our model on multiple audio classification tasks and found that our multi-task model achieves higher accuracy than task-specific models which train the models separately.

Journal Article

Share this book

Add to My Shelf

An Ensemble of Convolutional Neural Networks for Audio Classification

by Maguolo, Gianluca , Brahnam, Sheryl , Nanni, Loris in audio classification , Birds , Classification

2021

Research in sound classification and recognition is rapidly advancing in the field of pattern recognition. One important area in this field is environmental sound recognition, whether it concerns the identification of endangered species in different habitats or the type of interfering noise in urban environments. Since environmental audio datasets are often limited in size, a robust model able to perform well across different datasets is of strong research interest. In this paper, ensembles of classifiers are combined that exploit six data augmentation techniques and four signal representations for retraining five pre-trained convolutional neural networks (CNNs); these ensembles are tested on three freely available environmental audio benchmark datasets: (i) bird calls, (ii) cat sounds, and (iii) the Environmental Sound Classification (ESC-50) database for identifying sources of noise in environments. To the best of our knowledge, this is the most extensive study investigating ensembles of CNNs for audio classification. The best-performing ensembles are compared and shown to either outperform or perform comparatively to the best methods reported in the literature on these datasets, including on the challenging ESC-50 dataset. We obtained a 97% accuracy on the bird dataset, 90.51% on the cat dataset, and 88.65% on ESC-50 using different approaches. In addition, the same ensemble model trained on the three datasets managed to reach the same results on the bird and cat datasets while losing only 0.1% on ESC-50. Thus, we have managed to create an off-the-shelf ensemble that can be trained on different datasets and reach performances competitive with the state of the art.

Journal Article

Share this book

Add to My Shelf

A Machine Learning-Assisted Automation System for Optimizing Session Preparation Time in Digital Audio Workstations

by Ioniță, Horia Sebastian , Negru, Marian , Paleologu, Constantin in audio classification , audio engineering , Audio equipment

2025

Modern audio production workflows often require significant manual effort during the initial session preparation phase, including track labeling, format standardization, and gain staging. This paper presents a rule-based and Machine Learning-assisted automation system designed to minimize the time required for these tasks in Digital Audio Workstations (DAWs). The system automatically detects and labels audio tracks, identifies and eliminates redundant fake stereo channels, merges double-tracked instruments into stereo pairs, standardizes sample rate and bit rate across all tracks, and applies initial gain staging using target loudness values derived from a Genetic Algorithm (GA)-based system, which optimizes gain levels for individual track types based on engineer preferences and instrument characteristics. By replacing manual setup processes with automated decision-making methods informed by Machine Learning (ML) and rule-based heuristics, the system reduces session preparation time by up to 70% in typical multitrack audio projects. The proposed approach highlights how practical automation, combined with lightweight Neural Network (NN) models, can optimize workflow efficiency in real-world music production environments.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter