Catalogue Search | MBRL

Evaluating bias due to data linkage error in electronic healthcare records

by Harron, Katie , Muller-Pebody, Berit , Wade, Angie in Analysis , Bias , Data analysis

2014

Background Linkage of electronic healthcare records is becoming increasingly important for research purposes. However, linkage error due to mis-recorded or missing identifiers can lead to biased results. We evaluated the impact of linkage error on estimated infection rates using two different methods for classifying links: highest-weight (HW) classification using probabilistic match weights and prior-informed imputation (PII) using match probabilities. Methods A gold-standard dataset was created through deterministic linkage of unique identifiers in admission data from two hospitals and infection data recorded at the hospital laboratories (original data). Unique identifiers were then removed and data were re-linked by date of birth, sex and Soundex using two classification methods: i) HW classification - accepting the candidate record with the highest weight exceeding a threshold and ii) PII–imputing values from a match probability distribution. To evaluate methods for linking data with different error rates, non-random error and different match rates, we generated simulation data. Each set of simulated files was linked using both classification methods. Infection rates in the linked data were compared with those in the gold-standard data. Results In the original gold-standard data, 1496/20924 admissions linked to an infection. In the linked original data, PII provided least biased results: 1481 and 1457 infections (upper/lower thresholds) compared with 1316 and 1287 (HW upper/lower thresholds). In the simulated data, substantial bias (up to 112%) was introduced when linkage error varied by hospital. Bias was also greater when the match rate was low or the identifier error rate was high and in these cases, PII performed better than HW classification at reducing bias due to false-matches. Conclusions This study highlights the importance of evaluating the potential impact of linkage error on results. PII can help incorporate linkage uncertainty into analysis and reduce bias due to linkage error, without requiring identifiers.

Journal Article

Share this book

Add to My Shelf

Estimating parameters for probabilistic linkage of privacy-preserved datasets

by Boyd, James H. , Brown, Adrian P. , Ferrante, Anna M. in Agreements , Algorithms , Analysis

2017

Background Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters. Methods Our method was tested through a simulation study using synthetic data, followed by an application using real-world administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for de-duplication linkages. Linkage quality was determined by F-measure. Each dataset was privacy-preserved using separate Bloom filters for each field. Match probabilities were estimated using the expectation-maximisation (EM) algorithm on the privacy-preserved data. Threshold cut-off values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. De-duplication linkages of each privacy-preserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the F-measure at the estimated threshold values was also compared to the highest F-measure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on real-world data. Results Linkage of the synthetic datasets using the estimated probabilities produced an F-measure that was comparable to the F-measure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an F-measure that was higher than the F-measure using calculated probabilities. Further, the threshold estimation yielded results for F-measure that were only slightly below the highest possible for those probabilities. Conclusions The method appears highly accurate across a spectrum of datasets with varying degrees of error. As there are few alternatives for parameter estimation, the approach is a major step towards providing a complete operational approach for probabilistic linkage of privacy-preserved datasets.

Journal Article

Share this book

Add to My Shelf

Evaluation of record linkage of two large administrative databases in a middle income country: stillbirths and notifications of dengue during pregnancy in Brazil

by Harron, Katie , Fiaccone, Rosemeire L. , Teixeira, Maria Glória in Algorithms , Analysis , Brazil

2017

Background Due to the increasing availability of individual-level information across different electronic datasets, record linkage has become an efficient and important research tool. High quality linkage is essential for producing robust results. The objective of this study was to describe the process of preparing and linking national Brazilian datasets, and to compare the accuracy of different linkage methods for assessing the risk of stillbirth due to dengue in pregnancy. Methods We linked mothers and stillbirths in two routinely collected datasets from Brazil for 2009–2010: for dengue in pregnancy, notifications of infectious diseases (SINAN); for stillbirths, mortality (SIM). Since there was no unique identifier, we used probabilistic linkage based on maternal name, age and municipality. We compared two probabilistic approaches, each with two thresholds: 1) a bespoke linkage algorithm; 2) a standard linkage software widely used in Brazil ( ReclinkIII ), and used manual review to identify further links. Sensitivity and positive predictive value (PPV) were estimated using a subset of gold-standard data created through manual review. We examined the characteristics of false-matches and missed-matches to identify any sources of bias. Results From records of 678,999 dengue cases and 62,373 stillbirths, the gold-standard linkage identified 191 cases. The bespoke linkage algorithm with a conservative threshold produced 131 links, with sensitivity = 64.4% (68 missed-matches) and PPV = 92.5% (8 false-matches). Manual review of uncertain links identified an additional 37 links, increasing sensitivity to 83.7%. The bespoke algorithm with a relaxed threshold identified 132 true matches (sensitivity = 69.1%), but introduced 61 false-matches (PPV = 68.4%). ReclinkIII produced lower sensitivity and PPV than the bespoke linkage algorithm. Linkage error was not associated with any recorded study variables. Conclusion Despite a lack of unique identifiers for linking mothers and stillbirths, we demonstrate a high standard of linkage of large routine databases from a middle income country. Probabilistic linkage and manual review were essential for accurately identifying cases for a case-control study, but this approach may not be feasible for larger databases or for linkage of more common outcomes.

Journal Article

Share this book

Add to My Shelf

Application of Privacy-Preserving Techniques in Operational Record Linkage Centres

by Boyd, James H. , Randall, Sean M. , Ferrante, Anna M. in Data Provider , Disclosure Risk , Linkage Quality

2015

Record linkage is the process of bringing together data relating to the same individual within and between different datasets. These integrated datasets provide diverse and rich resources for researchers without the cost associated with additional data collection. By their nature, record linkage systems deal with large volumes of data and require complex organizational and technical infrastructure. Bringing together information from different sources often requires many different organizations to collaborate and share data, which presents challenges around data privacy and confidentiality. Various processes and protocols have been developed to protect the privacy of individuals during the record linkage process. These include data governance procedures covering people, processes and information technology, role separation and restricted data flows. Combinations of these are used to mitigate risks to privacy by limiting access to certain information. In addition, privacy-preserving record linkage techniques can be utilized to further reduce the risk to privacy, by removing all personal identifying information from linkage protocols. This chapter reviews current practices, processes and developments for maintaining security and privacy as applied in existing record linkage centres. Models for role separation and data flows are outlined and evaluated, and requirements for an effective privacy-preserving record linkage protocol are described.

Book Chapter

Share this book

Add to My Shelf

Understanding factors influencing linkage to HIV care in a rural setting, Mbeya, Tanzania: qualitative findings of a mixed methods study

by Lerebo, Wondwossen , Mushi, Adiel K. , Sanga, Erica S. in Acquired immune deficiency syndrome , Adult , AIDS

2019

Background In remote rural Tanzania, the rate of linkage into HIV care was estimated at 28% in 2014. This study explored facilitators and barriers to linkage to HIV care at individual/patient, health care provider, health system, and contextual levels to inform eventual design of interventions to improve linkage to HIV care. Methods We conducted a descriptive qualitative study nested in a cohort study of 1012 newly diagnosed HIV-positive individuals in Mbeya region between August 2014 and July 2015. We conducted 8 focus group discussions and 10 in-depth interviews with recently diagnosed HIV-positive individuals and 20 individual interviews with healthcare providers. Transcripts were analyzed inductively using thematic content analysis. The emergent themes were then deductively fitted into the four level ecological model. Results We identified multiple factors influencing linkage to care. HIV status disclosure, support from family/relatives and having symptoms of disease were reported to facilitate linkage at the individual level. Fear of stigma, lack of disclosure, denial and being asymptomatic, belief in witchcraft and spiritual beliefs were barriers identified at individual’s level. At providers’ level; support and good patient-staff relationship facilitated linkage, while negative attitudes and abusive language were reported barriers to successful linkage. Clear referral procedures and well-organized clinic procedures were system-level facilitators, whereas poorly organized clinic procedures and visit schedules, overcrowding, long waiting times and lack of resources were reported barriers. Distance and transport costs to HIV care centers were important contextual factors influencing linkage to care. Conclusion Linkage to HIV care is an important step towards proper management of HIV. We found that access and linkage to care are influenced positively and negatively at all levels, however, the individual-level and health system-level factors were most prominent in this setting. Interventions must address issues around stigma, denial and inadequate awareness of the value of early linkage to care, and improve the capacity of HIV treatment/care clinics to implement quality care, particularly in light of adopting the ‘Test and Treat’ model of HIV treatment and care recommended by the World Health Organization.

Journal Article

Share this book

Add to My Shelf

Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage

by Tromp, Miranda , Ravelli, Anita C. , Bonsel, Gouke J. in Agreements , Bias , Biological and medical sciences

2011

To gain insight into the performance of deterministic record linkage (DRL) vs. probabilistic record linkage (PRL) strategies under different conditions by varying the frequency of registration errors and the amount of discriminating power. A simulation study in which data characteristics were varied to create a range of realistic linkage scenarios. For each scenario, we compared the number of misclassifications (number of false nonlinks and false links) made by the different linking strategies: deterministic full, deterministic N-1, and probabilistic. The full deterministic strategy produced the lowest number of false positive links but at the expense of missing considerable numbers of matches dependent on the error rate of the linking variables. The probabilistic strategy outperformed the deterministic strategy (full or N−1) across all scenarios. A deterministic strategy can match the performance of a probabilistic approach providing that the decision about which disagreements should be tolerated is made correctly. This requires a priori knowledge about the quality of all linking variables, whereas this information is inherently generated by a probabilistic strategy. PRL is more flexible and provides data about the quality of the linkage process that in turn can minimize the degree of linking errors, given the data provided.

Journal Article

Share this book

Add to My Shelf

The Danish Medical Birth Register

by Bliddal, Mette , Broe, Anne , Olsen, Jørn in Adult , Birth , Body mass index

2018

The Danish Medical Birth Register was established in 1973. It is a key component of the Danish health information system. The register enables monitoring of the health of pregnant women and their offspring, it provides data for quality assessment of the perinatal care in Denmark, and it is used extensively for research. The register underwent major changes in construction and content in 1997, and new variables have been added during the last 20 years. The aim was to provide an updated description of the register focusing on structure, content, and coverage since 1997. The register includes data on all births in Denmark and comprises primarily of data from the Danish National Patient Registry supplemented with forms on home deliveries and stillbirths. It contains information on maternal age provided by the Civil Registration System. Information on pre-pregnancy body mass index and smoking in first trimester is collected in early pregnancy (first antenatal visit). The individual-level data can be linked to other Danish health registers such as the National Patient Registry and the Danish National Prescription Registry. The register informs several other registers/databases such as the Danish Twin Registry and the Danish Fetal Medicine Database. Aggregated data can be publicly accessed on the Danish Health Data Authority web page (www.esundhed.dk/sundhedsregistre/MFR). Researchers can obtain access to individual-level pseudoanonymised data via servers at Statistics Denmark and the Danish Health Data Authority.

Journal Article

Share this book

Add to My Shelf

High-density LD-based structural variations analysis in ten Native and Mestizo Mexican populations

by Villa-Angulo, Carlos , Mateos-Valenzuela, Adriana Griselda , Villa-Angulo, Rafael in ABCA1 protein , Adipose tissue , Analysis

2025

The main objective of this study was to perform a genome-wide characterization of Structural Variations (SV) based on the deviation of the expected short-range Linkage Disequilibrium (LD) between Single Nucleotide Polymorphisms (SNPs) in 10 Native and Mestizo Mexican populations. We used a panel of 785,663 SNP genotypes, sampled from 383 individuals, of which 71 belonged to ethnic populations and 312 belonged to mestizo populations. The total number of variations found among all populations was 4,375, involving an average of 19,438 SNPs per population, which corresponds to the 3.14% of the total average of SNPs per population. The mean SV size varied from 2,845–8,646 kb across populations (with a mean SV size of 6,161 kb over all populations) and an average of 50.14 SNPs per SV. By grouping all variations across all populations in the sample we defined 506 regions, from which in 54 (11%) regions the 10 populations coincided. The total number of genes covered by these variations was 8,443. And, from all genes we identified some specifically related to Mexican health, as the genes FTO and ABCA1 associated with obesity, with the adipose tissue function, and with the distribution of fat in Mexican population; the gene ELMO1 associated with the susceptibility to diabetic nephropathy and diabetes type II, among others. In summary, our results add new evidence in support of the hypothesis that SVs based on the deviation of the expected short-range LD between SNPs capture the structure and the demographic history of populations, and represent potential targets for association of SVs with population-specific diseases.

Journal Article

Share this book

Add to My Shelf

A blinded evaluation of privacy preserving record linkage with Bloom filters

by Boyd, James , Randall, Sean , Brown, Adrian in Analysis , Data mining , Datasets

2022

Background Privacy preserving record linkage (PPRL) methods using Bloom filters have shown promise for use in operational linkage settings. However real-world evaluations are required to confirm their suitability in practice. Methods An extract of records from the Western Australian (WA) Hospital Morbidity Data Collection 2011–2015 and WA Death Registrations 2011–2015 were encoded to Bloom filters, and then linked using privacy-preserving methods. Results were compared to a traditional, un-encoded linkage of the same datasets using the same blocking criteria to enable direct investigation of the comparison step. The encoded linkage was carried out in a blinded setting, where there was no access to un-encoded data or a ‘truth set’. Results The PPRL method using Bloom filters provided similar linkage quality to the traditional un-encoded linkage, with 99.3% of ‘groupings’ identical between privacy preserving and clear-text linkage. Conclusion The Bloom filter method appears suitable for use in situations where clear-text identifiers cannot be provided for linkage.

Journal Article

Share this book

Add to My Shelf

Validation of data quality in the Swedish National Register for Breast Cancer

by Sandelin, Kerstin , Asterkvist, Annette , Löfgren, Lars in Adult , Agreements , Benchmarking

2019

Background The National Breast Cancer Register (NBCR) of Sweden was launched in 2008 and is used for quality assurance, benchmarking, and research. Its three reporting forms encompass Notification, Adjuvant therapy and Follow-up. Target levels are set by national and international guidelines. This national validation assessed data quality of the register. Methods Data recorded through the Notification form were evaluated for completeness, timeliness, comparability and validity. Completeness was assessed by cross-linkage to the Swedish Cancer Register (SCR). Comparability was analyzed by comparing registration routines in NBCR with national and international guidelines. Timeliness was defined as the difference between the earliest date of diagnosis and the reporting date to NBCR. Validity was assessed by re-abstraction of medical chart data for 800 randomly selected patients diagnosed in 2013. Results The completeness of the NBCR was high with a coverage across regions and years (2010–2014) of 99.9%. Of all incident cases reported to the NBCR in 2013 ( N = 8654), 98.5% were included within 12 months and differences between health regions were essentially negligible. Coding procedures followed guidelines and were uniformly adhered to. The proportion of missing values was < 5% for most variables and reported information generally had high exact agreement (> 90%). Conclusions Completeness of data, comparability and agreement in the NBCR was high. For clinical quality purposes and benchmarking, improved timeliness is warranted. Assessment of validity has resulted in a thorough review of all variables included in the Notification form with clarifications and revision of selected variables.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter