Catalogue Search | MBRL

Synthetic Generation of Passive Infrared Motion Sensor Data Using a Game Engine

by Olsson, Carl Magnus , Karlsson, Fredrik , Persson, Magnus in Automation , Cameras , Computer simulation

2021

Quantifying the number of occupants in an indoor space is useful for a wide variety of applications. Attempts have been made at solving the task using passive infrared (PIR) motion sensor data together with supervised learning methods. Collecting a large labeled dataset containing both PIR motion sensor data and ground truth people count is however time-consuming, often requiring one hour of observation for each hour of data gathered. In this paper, a method is proposed for generating such data synthetically. A simulator is developed in the Unity game engine capable of producing synthetic PIR motion sensor data by detecting simulated occupants. The accuracy of the simulator is tested by replicating a real-world meeting room inside the simulator and conducting an experiment where a set of choreographed movements are performed in the simulated environment as well as the real room. In 34 out of 50 tested situations, the output from the simulated PIR sensors is comparable to the output from the real-world PIR sensors. The developed simulator is also used to study how a PIR sensor’s output changes depending on where in a room a motion is carried out. Through this, the relationship between sensor output and spatial position of a motion is discovered to be highly non-linear, which highlights some of the difficulties associated with mapping PIR data to occupancy count.

Journal Article

Share this book

Add to My Shelf

Survey on Synthetic Data Generation, Evaluation Methods and GANs

by Vaz, Bruno , Figueira, Alvaro in Algorithms , Citations , Datasets

2022

Synthetic data consists of artificially generated data. When data are scarce, or of poor quality, synthetic data can be used, for example, to improve the performance of machine learning models. Generative adversarial networks (GANs) are a state-of-the-art deep generative models that can generate novel synthetic samples that follow the underlying data distribution of the original dataset. Reviews on synthetic data generation and on GANs have already been written. However, none in the relevant literature, to the best of our knowledge, has explicitly combined these two topics. This survey aims to fill this gap and provide useful material to new researchers in this field. That is, we aim to provide a survey that combines synthetic data generation and GANs, and that can act as a good and strong starting point for new researchers in the field, so that they have a general overview of the key contributions and useful references. We have conducted a review of the state-of-the-art by querying four major databases: Web of Sciences (WoS), Scopus, IEEE Xplore, and ACM Digital Library. This allowed us to gain insights into the most relevant authors, the most relevant scientific journals in the area, the most cited papers, the most significant research areas, the most important institutions, and the most relevant GAN architectures. GANs were thoroughly reviewed, as well as their most common training problems, their most important breakthroughs, and a focus on GAN architectures for tabular data. Further, the main algorithms for generating synthetic data, their applications and our thoughts on these methods are also expressed. Finally, we reviewed the main techniques for evaluating the quality of synthetic data (especially tabular data) and provided a schematic overview of the information presented in this paper.

Journal Article

Share this book

Add to My Shelf

A Review of Synthetic Image Data and Its Use in Computer Vision

by Man, Keith , Chahl, Javaan in Algorithms , Annotations , Artificial neural networks

2022

Development of computer vision algorithms using convolutional neural networks and deep learning has necessitated ever greater amounts of annotated and labelled data to produce high performance models. Large, public data sets have been instrumental in pushing forward computer vision by providing the data necessary for training. However, many computer vision applications cannot rely on general image data provided in the available public datasets to train models, instead requiring labelled image data that is not readily available in the public domain on a large scale. At the same time, acquiring such data from the real world can be difficult, costly to obtain, and manual labour intensive to label in large quantities. Because of this, synthetic image data has been pushed to the forefront as a potentially faster and cheaper alternative to collecting and annotating real data. This review provides general overview of types of synthetic image data, as categorised by synthesised output, common methods of synthesising different types of image data, existing applications and logical extensions, performance of synthetic image data in different applications and the associated difficulties in assessing data performance, and areas for further research.

Journal Article

Share this book

Add to My Shelf

Generation and evaluation of synthetic patient data

by Stevens, Jennifer , Goncalves, Andre , Soper, Braden in BASIC BIOLOGICAL SCIENCES , Cancer patient data , Cancer research

2020

Background Machine learning (ML) has made a significant impact in medicine and cancer research; however, its impact in these areas has been undeniably slower and more limited than in other application domains. A major reason for this has been the lack of availability of patient data to the broader ML research community, in large part due to patient privacy protection concerns. High-quality, realistic, synthetic datasets can be leveraged to accelerate methodological developments in medicine. By and large, medical data is high dimensional and often categorical. These characteristics pose multiple modeling challenges. Methods In this paper, we evaluate three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks. Metrics for evaluating the quality of the generated synthetic datasets are presented and discussed. Results While the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets for cancer based on the publicly available cancer registry data from the Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast, respiratory, and non-solid cancer cases diagnosed between 2010 and 2015, which includes over 360,000 individual cases. Conclusions We discuss the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of medical synthetic data.

Journal Article

Share this book

Add to My Shelf

Pixel-Wise Crowd Understanding via Synthetic Data

by Wang, Qi , Gao Junyu , Yuan, Yuan in Algorithms , Computer & video games , Computer vision

2021

Crowd analysis via computer vision techniques is an important topic in the field of video surveillance, which has wide-spread applications including crowd monitoring, public safety, space design and so on. Pixel-wise crowd understanding is the most fundamental task in crowd analysis because of its finer results for video sequences or still images than other analysis tasks. Unfortunately, pixel-level understanding needs a large amount of labeled training data. Annotating them is an expensive work, which causes that current crowd datasets are small. As a result, most algorithms suffer from over-fitting to varying degrees. In this paper, take crowd counting and segmentation as examples from the pixel-wise crowd understanding, we attempt to remedy these problems from two aspects, namely data and methodology. Firstly, we develop a free data collector and labeler to generate synthetic and labeled crowd scenes in a computer game, Grand Theft Auto V. Then we use it to construct a large-scale, diverse synthetic crowd dataset, which is named as “GCC Dataset”. Secondly, we propose two simple methods to improve the performance of crowd understanding via exploiting the synthetic data. To be specific, (1) supervised crowd understanding: pre-train a crowd analysis model on the synthetic data, then fine-tune it using the real data and labels, which makes the model perform better on the real world; (2) crowd understanding via domain adaptation: translate the synthetic data to photo-realistic images, then train the model on translated data and labels. As a result, the trained model works well in real crowd scenes.Extensive experiments verify that the supervision algorithm outperforms the state-of-the-art performance on four real datasets: UCF_CC_50, UCF-QNRF, and Shanghai Tech Part A/B Dataset. The above results show the effectiveness, values of synthetic GCC for the pixel-wise crowd understanding. The tools of collecting/labeling data, the proposed synthetic dataset and the source code for counting models are available at https://gjy3035.github.io/GCC-CL/.

Journal Article

Share this book

Add to My Shelf

Bt-GAN: Generating Fair Synthetic Healthdata via Bias-transforming Generative Adversarial Networks

by Heintz, Fredrik , Ramachandranpillai, Resmi , Sikder, Md Fahim in Bias , Electronic health records , Fair data generation

2024

Synthetic data generation offers a promising solution to enhance the usefulness of Electronic Healthcare Records (EHR) by generating realistic de-identified data. However, the existing literature primarily focuses on the quality of synthetic health data, neglecting the crucial aspect of fairness in downstream predictions. Consequently, models trained on synthetic EHR have faced criticism for producing biased outcomes in target tasks. These biases can arise from either spurious correlations between features or the failure of models to accurately represent sub-groups. To address these concerns, we present Bias-transforming Generative Adversarial Networks (Bt-GAN), a GAN-based synthetic data generator specifically designed for the healthcare domain. In order to tackle spurious correlations (i), we propose an information-constrained Data Generation Process (DGP) that enables the generator to learn a fair deterministic transformation based on a well-defined notion of algorithmic fairness. To overcome the challenge of capturing exact sub-group representations (ii), we incentivize the generator to preserve sub-group densities through score-based weighted sampling. This approach compels the generator to learn from underrepresented regions of the data manifold. To evaluate the effectiveness of our proposed method, we conduct extensive experiments using the Medical Information Mart for Intensive Care (MIMIC-III) database. Our results demonstrate that Bt-GAN achieves state-of-the-art accuracy while significantly improving fairness and minimizing bias amplification. Furthermore, we perform an in-depth explainability analysis to provide additional evidence supporting the validity of our study. In conclusion, our research introduces a novel and professional approach to addressing the limitations of synthetic data generation in the healthcare domain. By incorporating fairness considerations and leveraging advanced techniques such as GANs, we pave the way for more reliable and unbiased predictions in healthcare applications.

Journal Article

Share this book

Add to My Shelf

Generative Pre-Trained Transformer (GPT) in Research: A Systematic Review on Data Augmentation

by Sufi, Fahim in Classification , Computational linguistics , Data analysis

2024

GPT (Generative Pre-trained Transformer) represents advanced language models that have significantly reshaped the academic writing landscape. These sophisticated language models offer invaluable support throughout all phases of research work, facilitating idea generation, enhancing drafting processes, and overcoming challenges like writer’s block. Their capabilities extend beyond conventional applications, contributing to critical analysis, data augmentation, and research design, thereby elevating the efficiency and quality of scholarly endeavors. Strategically narrowing its focus, this review explores alternative dimensions of GPT and LLM applications, specifically data augmentation and the generation of synthetic data for research. Employing a meticulous examination of 412 scholarly works, it distills a selection of 77 contributions addressing three critical research questions: (1) GPT on Generating Research data, (2) GPT on Data Analysis, and (3) GPT on Research Design. The systematic literature review adeptly highlights the central focus on data augmentation, encapsulating 48 pertinent scholarly contributions, and extends to the proactive role of GPT in critical analysis of research data and shaping research design. Pioneering a comprehensive classification framework for “GPT’s use on Research Data”, the study classifies existing literature into six categories and 14 sub-categories, providing profound insights into the multifaceted applications of GPT in research data. This study meticulously compares 54 pieces of literature, evaluating research domains, methodologies, and advantages and disadvantages, providing scholars with profound insights crucial for the seamless integration of GPT across diverse phases of their scholarly pursuits.

Journal Article

Share this book

Add to My Shelf

Functional assessment of bidirectional cortical and peripheral neural control on heartbeat dynamics: A brain-heart study on thermal stress

by Barbieri, Riccardo , Candia-Rivera, Diego , Catrambone, Vincenzo in Brain , Brain-heart interplay , Cognitive ability

2022

•We propose a new framework to assess neural dynamics involved in heartbeat control.•The modeling is based on coupled synthetic data generators of EEG and RR series.•Cardiac sympathovagal activity is modelled through Laguerre expansions of RR series.•Time-varying directional brain-heart interplay is quantified under thermal stress. The study of functional Brain-Heart Interplay (BHI) from non-invasive recordings has gained much interest in recent years. Previous endeavors aimed at understanding how the two dynamical systems exchange information, providing novel holistic biomarkers and important insights on essential cognitive aspects and neural system functioning. However, the interplay between cardiac sympathovagal and cortical oscillations still has much room for further investigation. In this study, we introduce a new computational framework for a functional BHI assessment, namely the Sympatho-Vagal Synthetic Data Generation Model, combining cortical (electroencephalography, EEG) and peripheral (cardiac sympathovagal) neural dynamics. The causal, bidirectional neural control on heartbeat dynamics was quantified on data gathered from 26 human volunteers undergoing a cold-pressor test. Results show that thermal stress induces heart-to-brain functional interplay sustained by EEG oscillations in the delta and gamma bands, primarily originating from sympathetic activity, whereas brain-to-heart interplay originates over central brain regions through sympathovagal control. The proposed methodology provides a viable computational tool for the functional assessment of the causal interplay between cortical and cardiac neural control.

Journal Article

Share this book

Add to My Shelf

DatRel: a noise-tolerant data relocation approach for effective synthetic data generation in imbalanced classifiers

by Sağlam, Fatih in Algorithms , Artificial Intelligence , Computer Science

2025

Most machine learning algorithms tend to bias towards the majority class when a dataset exhibits a skewed distribution in the class variable. This is called the class imbalance problem and is frequently encountered in real-life applications. One of the most prevalent methods for addressing class imbalance is data resampling, which generates or removes samples to balance the dataset. A well-known issue with oversampling is noise generation. Noise removal or hybrid resampling is used to deal with noise. However, these methods cause imbalance to re-emerge. In this study, a data relocation approach named DatRel is proposed to address the noise generation problem of oversampling without causing imbalance. The proposed approach utilizes pure and proper class cover catch digraphs (P-CCCD) to determine dominant points and cover areas for minority class. Then, new samples from oversampling are drawn to the dominant points until they are covered. This process ensures that newly generated samples never overlap with a negative sample. Imbalance is not affected since no sample is removed by undersampling. The proposed DatRel approach is applied to commonly used oversampling methods, namely SMOTE, ADASYN, and BLSMOTE. Moreover, the performance of the DatRel approach is compared to noise filtering methods such as Tomeklink, ENN, NEATER, and NearMiss after SMOTE. Several baseline classification algorithms are employed, and comparisons are made using various metrics. Results using 49 imbalanced datasets show that DatRel improves classifier performance in oversampling methods and demonstrates its value in comparison to other noise removal techniques according to AUC, BACC, F1, GMEAN, and MCC.

Journal Article

Share this book

Add to My Shelf

Hybrid-Model-Based Digital Twin of the Drivetrain of a Wind Turbine and Its Application for Failure Synthetic Data Generation

by Pujana, Ainhoa , Perea, Eugenio , Maqueda, Erik in Algorithms , Artificial intelligence , Datasets

2023

Computer modelling and digitalization are integral to the wind energy sector since they provide tools with which to improve the design and performance of wind turbines, and thus reduce both capital and operational costs. The massive sensor rollout and increase in big data processing capacity over the last decade has made data collection and analysis more efficient, allowing for the development and use of digital twins. This paper presents a methodology for developing a hybrid-model-based digital twin (DT) of a power conversion system of wind turbines. This DT allows knowledge to be acquired from real operation data while preserving physical design relationships, can generate synthetic data from events that never happened, and helps in the detection and classification of different failure conditions. Starting from an initial physics-based model of a wind turbine drivetrain, which is trained with real data, the proposed methodology has two major innovative outcomes. The first innovation aspect is the application of generative stochastic models coupled with a hybrid-model-based digital twin (DT) for the creation of synthetic failure data based on real anomalies observed in SCADA data. The second innovation aspect is the classification of failures based on machine learning techniques, that allows anomaly conditions to be identified in the operation of the wind turbine. Firstly, technique and methodology were contrasted and validated with operation data of a real wind farm owned by Engie, including labelled failure conditions. Although the selected use case technology is based on a double-fed induction generator (DFIG) and its corresponding partial-scale power converter, the methodology could be applied to other wind conversion technologies.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter