Catalogue Search | MBRL

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

by Wallace, Jonathan , Mulvenna, Maurice , Epelde, Gorka in Datasets , Health care policy , Information sharing

2020

The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.

Journal Article

Share this book

Add to My Shelf

Comparative assessment of synthetic time series generation approaches in healthcare: leveraging patient metadata for accurate data synthesis

by Larrea, Xabat , Epelde, Gorka , Beristain, Andoni in Access to information , Advanced machine learning and health-related multi-omics data , Coupling

2024

Background Synthetic data is an emerging approach for addressing legal and regulatory concerns in biomedical research that deals with personal and clinical data, whether as a single tool or through its combination with other privacy enhancing technologies. Generating uncompromised synthetic data could significantly benefit external researchers performing secondary analyses by providing unlimited access to information while fulfilling pertinent regulations. However, the original data to be synthesized (e.g., data acquired in Living Labs) may consist of subjects’ metadata (static) and a longitudinal component (set of time-dependent measurements), making it challenging to produce coherent synthetic counterparts. Methods Three synthetic time series generation approaches were defined and compared in this work: only generating the metadata and coupling it with the real time series from the original data (A1), generating both metadata and time series separately to join them afterwards (A2), and jointly generating both metadata and time series (A3). The comparative assessment of the three approaches was carried out using two different synthetic data generation models: the Wasserstein GAN with Gradient Penalty (WGAN-GP) and the DöppelGANger (DGAN). The experiments were performed with three different healthcare-related longitudinal datasets: Treadmill Maximal Effort Test (TMET) measurements from the University of Malaga (1), a hypotension subset derived from the MIMIC-III v1.4 database (2), and a lifelogging dataset named PMData (3). Results Three pivotal dimensions were assessed on the generated synthetic data: resemblance to the original data (1), utility (2), and privacy level (3). The optimal approach fluctuates based on the assessed dimension and metric. Conclusion The initial characteristics of the datasets to be synthesized play a crucial role in determining the best approach. Coupling synthetic metadata with real time series (A1), as well as jointly generating synthetic time series and metadata (A3), are both competitive methods, while separately generating time series and metadata (A2) appears to perform more poorly overall.

Journal Article

Share this book

Add to My Shelf

mHealth Apps Using Behavior Change Techniques to Self-report Data: Systematic Review

by Aguiar, Maria , Epelde, Gorka , Chaves, Deisy in Behavior , Behavior Therapy - methods , Chronic illnesses

2022

The popularization of mobile health (mHealth) apps for public health or medical care purposes has transformed human life substantially, improving lifestyle behaviors and chronic condition management. This review aimed to identify behavior change techniques (BCTs) commonly used in mHealth, assess their effectiveness based on the evidence reported in interventions and reviews to highlight the most appropriate techniques to design an optimal strategy to improve adherence to data reporting, and provide recommendations for future interventions and research. We performed a systematic review of studies published between 2010 and 2021 in relevant scientific databases to identify and analyze mHealth interventions using BCTs that evaluated their effectiveness in terms of user adherence. Search terms included a mix of general (eg, data, information, and adherence), computer science (eg, mHealth and BCTs), and medicine (eg, personalized medicine) terms. This systematic review included 24 studies and revealed that the most frequently used BCTs in the studies were feedback and monitoring (n=20), goals and planning (n=14), associations (n=14), shaping knowledge (n=12), and personalization (n=7). However, we found mixed effectiveness of the techniques in mHealth outcomes, having more effective than ineffective outcomes in the evaluation of apps implementing techniques from the feedback and monitoring, goals and planning, associations, and personalization categories, but we could not infer causality with the results and suggest that there is still a need to improve the use of these and many common BCTs for better outcomes. Personalization, associations, and goals and planning techniques were the most used BCTs in effective trials regarding adherence to mHealth apps. However, they are not necessarily the most effective since there are studies that use these techniques and do not report significant results in the proposed objectives; there is a notable overlap of BCTs within implemented app components, suggesting a need to better understand best practices for applying (a combination of) such techniques and to obtain details on the specific BCTs used in mHealth interventions. Future research should focus on studies with longer follow-up periods to determine the effectiveness of mHealth interventions on behavior change to overcome the limited evidence in the current literature, which has mostly small-sized and single-arm experiments with a short follow-up period.

Journal Article

Share this book

Add to My Shelf

An approach to boost adherence to self-data reporting in mHealth applications for users without specific health conditions

by Epelde, Gorka , Ayala, Unai , Tueros, Itziar in Adult , Behavior , Behavior change techniques

2025

Background The popularization of mobile health (mHealth) apps for public health or medical care purposes has transformed human life substantially, improving lifestyle behaviors and chronic condition management. The objective of this study is to evaluate the effect of gamification features in a mHealth app that includes the most common categories of behavior change techniques for the self-report of lifestyle data. The data reported by the user can be manual (i.e., diet, activity, and weight) and automatic (Fitbit wearable devices). As a secondary objective, this work aims to explore the differences in the adherence when considering a longer study duration and make a comparative analysis of the gamification effect. Methods In this study, the effectiveness of various behavior change techniques strategies is evaluated through the analysis of two user groups. With a first group of users, we perform a comparative analysis in terms of adherence and system usability scale of two versions of the app, both including the most common categories of behavior change techniques but the second version having added gamification features. Then, with a second group of participants and the best mHealth app version, a longer study is carried out and user adherence, the system usability scale and user feedback are analyzed. Results In the first stage study, results have shown that the app version with gamification features has achieved a higher adherence, as the percentage of days active was higher for most of the users and the system usability scale score is 80.67, which is categorized as rank A. The app also exceeded the expectations of the users by about 70% for the app version with gamification functionalities. In the second stage of the study, an adherence of 76.25% is reported after 8 weeks and 58% at the end of the pilot for the mHealth app. Similarly, for the wearable device, an adherence of 74.32% is achieved after 8 weeks and 81.08% is obtained at the end of the pilot. We hypothesize that these specific wearable devices have contributed to a decreased system usability scale score, reaching 62.89 which is ranked as C. Conclusion This study evidences the effectiveness of the gamification category of behavior change techniques in increasing the overall user adherence, expectations, and perceived usability. In addition, the results provide quantitative results on the effect of the most common categories of behavior change techniques for the self-report of lifestyle data. Therefore, a higher duration in the study has shown several limitations when capturing lifestyle data, especially when including wearable devices such as Fitbit.

Journal Article

Share this book

Add to My Shelf

Synthetic Tabular Data Generation Under Horizontal Federated Learning Environments in Acute Myeloid Leukemia: Case-Based Simulation Study

by Aginako, Naiara , Epelde, Gorka , Catalina, Mikel in Algorithms , Artificial intelligence , Bone marrow

2025

Data scarcity and dispersion pose significant obstacles in biomedical research, particularly when addressing rare diseases. In such scenarios, synthetic data generation (SDG) has emerged as a promising path to mitigate the first issue. Concurrently, federated learning is a machine learning paradigm where multiple nodes collaborate to create a centralized model with knowledge that is distilled from the data in different nodes, but without the need for sharing it. This research explores the combination of SDG and federated learning technologies in the context of acute myeloid leukemia, a rare hematological disorder, evaluating their combined impact and the quality of the generated artificial datasets. This study aims to evaluate the privacy- and fidelity-related impact of horizontally federating SDG models in different data distribution scenarios and with different numbers of nodes, comparing them with centralized baseline SDG models. Two state-of-the-art generative models, conditional tabular generative adversarial network and FedTabDiff, were trained considering four different scenarios: (1) a nonfederated baseline with all the data available, (2) a federated scenario where the data were evenly distributed among different nodes, (3) a federated scenario where the data were unevenly and randomly distributed (imbalanced data), and (4) a federated scenario with nonindependent and identically distributed data distributions. For each of the federated scenarios, a fixed set of node quantities (3, 5, 7, 10) was considered to assess its impact, and the generated data were evaluated, attending to a fidelity-privacy trade-off. The computed fidelity metrics exhibited statistically significant deteriorations (P<.001) up to 21% in the conditional tabular generative adversarial network and up to 62% in the FedTabDiff model due to the federation process. When comparing federated experiments trained with diverse numbers of nodes, no strong tendencies were observed, even if specific comparisons resulted in significative differences. Privacy metrics were mainly maintained while obtaining maximum improvements of 55% and maximum deteriorations of 26% between both models, although they were not statistically significant. Within the scope of the use case scenario in this paper, the act of horizontally federating SDG algorithms results in a loss of data fidelity compared to the nonfederated baseline while maintaining privacy levels. However, this deterioration does not significantly increase as the number of nodes used to train the models grows, even though significative differences were found in specific comparisons. The different data partition distribution configurations had no significant effect on the metrics, as similar tendencies were found for all scenarios.

Journal Article

Share this book

Add to My Shelf

OBINTER: A Holistic Approach to Catalyse the Self-Management of Chronic Obesity

by Arranz, Sara , Epelde, Gorka , Álvarez, Roberto in adherence , Body composition , Catalysis

2020

Obesity is a preventable chronic condition that, in 2016, affected more than 1.9 billion people globally. Several factors have been identified that have a positive impact on long-term weight loss programs such as personalized recommendations, adherence strategies, weight and diet follow-up or physical activity tracking. Recently, various applications have been developed which help patients to self-manage their condition. These apps implement either one or some of these identified factors; however, there is not a single application that combines all of them following a holistic approach. In this context, we developed the OBINTER platform, which assists patients during the weight loss process by targeting user engagement during the longer term. The solution includes a mobile application which allows users to fill out dietetic questionnaires, receive dietetic and nutraceutical plans, track the evolution of their weight and adherence to the diet, as well as track their physical activity via a wearable device. Furthermore, an adherence strategy has been developed as a tool to foster the app usage during the whole weight loss process. In this paper, we present how the OBINTER approach gathers all of these features as well as the positive results of a usability testing study performed to assess the performance and usability of the OBINTER platform.

Journal Article

Share this book

Add to My Shelf

Comprehensive evaluation framework for synthetic tabular data in health: fidelity, utility and privacy analysis of generative models with and without privacy guarantees

by Loinaz, Lorea , Aginako, Naiara , Hernandez, Mikel in Data analysis , Digital Health , generative models

2025

The generation of synthetic tabular data has emerged as a key privacy-enhancing technology to address challenges in data sharing, particularly in healthcare, where sensitive attributes can compromise patient privacy. Despite significant progress, balancing fidelity, utility, and privacy in complex medical datasets remains a substantial challenge. This paper introduces a comprehensive and holistic evaluation framework for synthetic tabular data, consolidating metrics and privacy risk measures across three key categories (fidelity, utility and privacy) and incorporating a fidelity-utility tradeoff metric. The framework was applied to three open-source medical datasets to evaluate synthetic tabular data generated by five generative models, both with and without differential privacy. Results showed that simpler models generally achieved better fidelity and utility, while more complex models provided lower privacy risks. The addition of differential privacy enhanced privacy preservation but often reduced fidelity and utility, highlighting the complexity of balancing fidelity, utility and privacy in synthetic data generation for medical datasets. Despite its contributions, this study acknowledges limitations, such as the lack of evaluation metrics neither privacy risk measures for required model training time and resource usage, reliance on default model parameters, and the assessment of models that incorporates differential privacy with only a single privacy budget. Future work should explore parameter optimization, alternative privacy mechanisms, broader applications of the framework to diverse datasets and domains, and collaborations with clinicians for clinical utility evaluation. This study provides a foundation for improving synthetic tabular data evaluation and advancing privacy-preserving data sharing in healthcare.

Journal Article

Share this book

Add to My Shelf

Incorporation of Synthetic Data Generation Techniques within a Controlled Data Processing Workflow in the Health and Wellbeing Domain

by Larrea, Xabat , Epelde, Gorka , Beristain, Andoni in Access control , Algorithms , Artificial intelligence

2022

To date, the use of synthetic data generation techniques in the health and wellbeing domain has been mainly limited to research activities. Although several open source and commercial packages have been released, they have been oriented to generating synthetic data as a standalone data preparation process and not integrated into a broader analysis or experiment testing workflow. In this context, the VITALISE project is working to harmonize Living Lab research and data capture protocols and to provide controlled processing access to captured data to industrial and scientific communities. In this paper, we present the initial design and implementation of our synthetic data generation approach in the context of VITALISE Living Lab controlled data processing workflow, together with identified challenges and future developments. By uploading data captured from Living Labs, generating synthetic data from them, developing analysis locally with synthetic data, and then executing them remotely with real data, the utility of the proposed workflow has been validated. Results have shown that the presented workflow helps accelerate research on artificial intelligence, ensuring compliance with data protection laws. The presented approach has demonstrated how the adoption of state-of-the-art synthetic data generation techniques can be applied for real-world applications.

Journal Article

Share this book

Add to My Shelf

An ensemble-based feature selection framework to select risk factors of childhood obesity for policy decision making

by Shi, Xi , Epelde, Gorka , De Moor, Bart in Blood pressure , Body weight , Breast feeding

2021

Background The increasing prevalence of childhood obesity makes it essential to study the risk factors with a sample representative of the population covering more health topics for better preventive policies and interventions. It is aimed to develop an ensemble feature selection framework for large-scale data to identify risk factors of childhood obesity with good interpretability and clinical relevance. Methods We analyzed the data collected from 426,813 children under 18 during 2000–2019. A BMI above the 90th percentile for the children of the same age and gender was defined as overweight. An ensemble feature selection framework, Bagging-based Feature Selection framework integrating MapReduce (BFSMR), was proposed to identify risk factors. The framework comprises 5 models (filter with mutual information/SVM-RFE/Lasso/Ridge/Random Forest) from filter, wrapper, and embedded feature selection methods. Each feature selection model identified 10 variables based on variable importance. Considering accuracy, F-score, and model characteristics, the models were classified into 3 levels with different weights: Lasso/Ridge, Filter/SVM-RFE, and Random Forest. The voting strategy was applied to aggregate the selected features, with both feature weights and model weights taken into consideration. We compared our voting strategy with another two for selecting top-ranked features in terms of 6 dimensions of interpretability. Results Our method performed the best to select the features with good interpretability and clinical relevance. The top 10 features selected by BFSMR are age, sex, birth year, breastfeeding type, smoking habit and diet-related knowledge of both children and mothers, exercise, and Mother’s systolic blood pressure. Conclusion Our framework provides a solution for identifying a diverse and interpretable feature set without model bias from large-scale data, which can help identify risk factors of childhood obesity and potentially some other diseases for future interventions or policies.

Journal Article

Share this book

Add to My Shelf

User Centered Virtual Coaching for Older Adults at Home Using SMART Goal Plans and I-Change Model

by Petsani, Despoina , Epelde, Gorka , Carroll, Joanne in Adults , Behavior , Caregivers

2021

Preventive care and telemedicine are expected to play an important role in reducing the impact of an increasingly aging global population while increasing the number of healthy years. Virtual coaching is a promising research area to support this process. This paper presents a user-centered virtual coach for older adults at home to promote active and healthy aging and independent living. It supports behavior change processes for improving on cognitive, physical, social interaction and nutrition areas using specific, measurable, achievable, relevant, and time-limited (SMART) goal plans, following the I-Change behavioral change model. Older adults select and personalize which goal plans to join from a catalog designed by domain experts. Intervention delivery adapts to user preferences and minimizes intrusiveness in the user’s daily living using a combination of a deterministic algorithm and incremental machine learning model. The home becomes an augmented reality environment, using a combination of projectors, cameras, microphones and support sensors, where common objects are used for projection and sensed. Older adults interact with this virtual coach in their home in a natural way using speech and body gestures on projected user interfaces with common objects at home. This paper presents the concept from the older adult and the caregiver perspectives. Then, it focuses on the older adult view, describing the tools and processes available to foster a positive behavior change process, including a discussion about the limitations of the current implementation.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter