Catalogue Search | MBRL

Agreement and utility of coded primary and secondary care data for long-term follow-up of clinical trial outcomes

by Jones, Nicholas , Seeley, Anna E , Williams, Marney in Agreements , Antihypertensives , Blood pressure

2025

Background Whilst interest in efficient trial design has grown with the use of electronic health records (EHRs) to collect trial outcomes, practical challenges remain. Commonly raised concerns often revolve around data availability, data quality and issues with data validation. This study aimed to assess the agreement between data collected on clinical trial participants from different sources to provide empirical evidence on the utility of EHRs for follow-up in randomised controlled trials (RCTs). Methods This retrospective, participant-level data utility comparison study was undertaken using data collected as part of a UK primary care-based, randomised controlled trial (OPTiMISE). The primary outcome measure was the recording of all-cause hospitalisation or mortality within 3 years post-randomisation and was assessed across (1) Coded primary care data; (2) Coded-plus-free-text primary care data; and (3) Coded secondary care and mortality data. Agreement levels across data sources were assessed using Fleiss’ Kappa (K). Kappa statistics were interpreted using an established framework, categorising agreement strength as follows: <0 (poor), 0.00–0.20 (slight), 0.21–0.40 (fair), 0.41–0.60 (moderate), 0.61–0.80 (substantial), and 0.81–1.00 (almost perfect) agreement. The impact of using different data sources to determine trial outcomes was assessed by replicating the trial’s original analyses. Results Almost perfect agreement was observed for mortality outcome across the three data sources (K = 0.94, 95%CI 0.91–0.98). Fair agreement (weak consistency) was observed for hospitalisation outcomes, including all-cause hospitalisation or mortality (K = 0.35, 95%CI 0.28–0.42), emergency hospitalisation (K = 0.39, 95%CI 0.33–0.46), and hospitalisation or mortality due to cardiovascular disease (K = 0.32, 95%CI 0.19–0.45). The overall trial results remained consistent across data sources for the primary outcome, albeit with varying precision. Conclusion Significant discrepancies according to data sources were observed in recording of secondary care outcomes. Investigators should be cautious when choosing which data source(s) to use to measure outcomes in trials. Future work on linking participant-level data across healthcare settings should consider the variations in diagnostic coding practices. Standardised definitions for outcome measures when using healthcare systems data and using data from different data sources for cross-checking and verification should be encouraged.

Journal Article

Share this book

Add to My Shelf

EFIM: a fast and memory efficient algorithm for high-utility itemset mining

by Tseng, Vincent S. , Zida, Souleymane , Lin, Jerry Chun-Wei in Algorithms , Computer memory , Computer Science

2017

In recent years, high-utility itemset mining has emerged as an important data mining task. However, it remains computationally expensive both in terms of runtime and memory consumption. It is thus an important challenge to design more efficient algorithms for this task. In this paper, we address this issue by proposing a novel algorithm named EFIM (EFficient high-utility Itemset Mining), which introduces several new ideas to more efficiently discover high-utility itemsets. EFIM relies on two new upper bounds named revised sub-tree utility and local utility to more effectively prune the search space. It also introduces a novel array-based utility counting technique named Fast Utility Counting to calculate these upper bounds in linear time and space. Moreover, to reduce the cost of database scans, EFIM proposes efficient database projection and transaction merging techniques named High-utility Database Projection and High-utility Transaction Merging (HTM), also performed in linear time. An extensive experimental study on various datasets shows that EFIM is in general two to three orders of magnitude faster than the state-of-art algorithms d 2 HUP, HUI-Miner, HUP-Miner, FHM and UP-Growth+ on dense datasets and performs quite well on sparse datasets. Moreover, a key advantage of EFIM is its low memory consumption.

Journal Article

Share this book

Add to My Shelf

Structure-Utility of Descriptive Information of Agricultural Scientific Data from the Perspective of Users

by FAN Zhixuan, WANG Jian, SA Xu, ZHANG Guilan in scientific data|data description|metadata|information utility|eye-tracking

2022

[Purpose/Significance] This paper aims to study the structure-utility relationship of descriptive information of scientific data to provide a new perspective for the theoretical study of scientific data description and a reference for the best description of agricultural scientific data in the digital environment. [Method/Process] Based on information processing theory, the lens model, the probabilistic mental model theory and the adaptive decision-making behavior framework, the relationship model between descriptive information structure and informing utility was constructed. A situational experiment was designed according to the model. In this study, 47 postgraduates from 14 institutes were invited for quasi-experimental observation by using qualitative and quantitative methods such as eye-tracking, semi-structured interview and questionnaire. First, this study used a semi-structured interview to obtain a user's cognitive interpretation of fixation points and collected the descriptive items of agricultural scientific data and their use frequency by encoding the interview text. Second, this study combined descriptive item usage path coding and user judgment confidence to obtain the combination of descriptive items with high utility. Finally, the study used multiple regression analysis to identify the descriptive items with high utility and their predictive ability, and analyzed the impact of data literacy and data utilization type on the utility of descriptive items. [Results/Conclusions] The study identified 42 descriptive items of 11 categories of agricultural scientific data and their usage characteristics. Among them, the top 5 frequently used descriptive items were subject, data, overall description, source and data production information, which played an important role in user relevance judgment. Then this study identified the combination of descriptive items with high utility and found that users' use patterns of descriptive items were diverse. Compared with making a judgment with \"relevant\" result, users often needed less information to achieve a high level of confidence when making an \"irrelevant\" judgment. This study also found that the descriptive items with high utility include source, data, use and evaluation, and data production information. It is determined that user data literacy and data utilization purpose were the influencing factors of descriptive information utility, and the effects of the two factors were preliminarily analyzed. Based on this research, the paper put forward some suggestions for improving agricultural scientific data metadata and scientific data sharing. In the future, this study will be repeated in groups with different academic backgrounds and data literacy levels, so as to enhance the generalization ability of research conclusions and construct a more effective structure of scientific data descriptive information.

Journal Article

Share this book

Add to My Shelf

Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation

by Ibrahim, Mahmoud , Dankar, Fida K. in Accuracy , Big Data , data privacy

2021

Synthetic data provides a privacy protecting mechanism for the broad usage and sharing of healthcare data for secondary purposes. It is considered a safe approach for the sharing of sensitive data as it generates an artificial dataset that contains no identifiable information. Synthetic data is increasing in popularity with multiple synthetic data generators developed in the past decade, yet its utility is still a subject of research. This paper is concerned with evaluating the effect of various synthetic data generation and usage settings on the utility of the generated synthetic data and its derived models. Specifically, we investigate (i) the effect of data pre-processing on the utility of the synthetic data generated, (ii) whether tuning should be applied to the synthetic datasets when generating supervised machine learning models, and (iii) whether sharing preliminary machine learning results can improve the synthetic data models. Lastly, (iv) we investigate whether one utility measure (Propensity score) can predict the accuracy of the machine learning models generated from the synthetic data when employed in real life. We use two popular measures of synthetic data utility, propensity score and classification accuracy, to compare the different settings. We adopt a recent mechanism for the calculation of propensity, which looks carefully into the choice of model for the propensity score calculation. Accordingly, this paper takes a new direction with investigating the effect of various data generation and usage settings on the quality of the generated data and its ensuing models. The goal is to inform on the best strategies to follow when generating and using synthetic data.

Journal Article

Share this book

Add to My Shelf

On the Analysis of Utility and Risk for Masked Data in Big Data

by Torra, Vicenc in big data , Complex problems , Correlation coefficient

2018

Data privacy studies methods to ensure that disclosure of sensitive information does not take place. Masking methods are applied to databases prior to their release so that intruders cannot access sensitive information. Masking methods modify the data reducing its quality. Information loss measures have been defined to evaluate in what extent data is still useful for particular analysis. In the case of big data, masking data and evaluating its utility is a complex problem. In this paper we focus on information loss measurement and we explore if we can estimate or give bounds of information loss for large data sets using only random subsets of the whole data set.

Conference Proceeding

Share this book

Add to My Shelf

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

by Wallace, Jonathan , Mulvenna, Maurice , Epelde, Gorka in Datasets , Health care policy , Information sharing

2020

The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.

Journal Article

Share this book

Add to My Shelf

Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study

by Fang, Xi , El-Hussuna, Alaa , El Emam, Khaled in Cluster analysis , Datasets , Decision making

2022

A regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods. This study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research. We evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models. The utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions. This study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods.

Journal Article

Share this book

Add to My Shelf

Privacy-Preserving Synthetic Data Generation Method for IoT-Sensor Network IDS Using CTGAN

by Son, Yunsik , Kim, Young-Tak , Alabdulwahab, Saleh in Analysis , Artificial intelligence , Comparative analysis

2024

The increased usage of IoT networks brings about new privacy risks, especially when intrusion detection systems (IDSs) rely on large datasets for machine learning (ML) tasks and depend on third parties for storing and training the ML-based IDS. This study proposes a privacy-preserving synthetic data generation method using a conditional tabular generative adversarial network (CTGAN) aimed at maintaining the utility of IoT sensor network data for IDS while safeguarding privacy. We integrate differential privacy (DP) with CTGAN by employing controlled noise injection to mitigate privacy risks. The technique involves dynamic distribution adjustment and quantile matching to balance the utility–privacy tradeoff. The results indicate a significant improvement in data utility compared to the standard DP method, achieving a KS test score of 0.80 while minimizing privacy risks such as singling out, linkability, and inference attacks. This approach ensures that synthetic datasets can support intrusion detection without exposing sensitive information.

Journal Article

Share this book

Add to My Shelf

Enhancing data utility in differential privacy via microaggregation-based k-anonymity

by Domingo-Ferrer, Josep , Sánchez, David , Soria-Comas, Jordi in Accuracy , Data analysis , Datasets

2014

It is not uncommon in the data anonymization literature to oppose the “old” k -anonymity model to the “new” differential privacy model, which offers more robust privacy guarantees. Yet, it is often disregarded that the utility of the anonymized results provided by differential privacy is quite limited, due to the amount of noise that needs to be added to the output, or because utility can only be guaranteed for a restricted type of queries. This is in contrast with k -anonymity mechanisms, which make no assumptions on the uses of anonymized data while focusing on preserving data utility from a general perspective. In this paper, we show that a synergy between differential privacy and k -anonymity can be found: k -anonymity can help improving the utility of differentially private responses to arbitrary queries. We devote special attention to the utility improvement of differentially private published data sets. Specifically, we show that the amount of noise required to fulfill ε -differential privacy can be reduced if noise is added to a k -anonymous version of the data set, where k -anonymity is reached through a specially designed microaggregation of all attributes. As a result of noise reduction, the general analytical utility of the anonymized output is increased. The theoretical benefits of our proposal are illustrated in a practical setting with an empirical evaluation on three data sets.

Journal Article

Share this book

Add to My Shelf

k-Degree anonymity and edge selection: improving data utility in large networks

by Herrera-Joancomartí, Jordi , Torra, Vicenç , Casas-Roma, Jordi in Algorithms , Analysis , Anonymity

2017

The problem of anonymization in large networks and the utility of released data are considered in this paper. Although there are some anonymization methods for networks, most of them cannot be applied in large networks because of their complexity. In this paper, we devise a simple and efficient algorithm for k -degree anonymity in large networks. Our algorithm constructs a k -degree anonymous network by the minimum number of edge modifications. We compare our algorithm with other well-known k -degree anonymous algorithms and demonstrate that information loss in real networks is lowered. Moreover, we consider the edge relevance in order to improve the data utility on anonymized networks. By considering the neighbourhood centrality score of each edge, we preserve the most important edges of the network, reducing the information loss and increasing the data utility. An evaluation of clustering processes is performed on our algorithm, proving that edge neighbourhood centrality increases data utility. Lastly, we apply our algorithm to different large real datasets and demonstrate their efficiency and practical utility.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter