Catalogue Search | MBRL

The phenotypic legacy of admixture between modern humans and Neandertals

by Li, Rongling , Verma, Shefali S. , Jarvik, Gail P. in Alleles , Animals , Deoxyribonucleic acid

2016

Many modern human genomes retain DNA inherited from interbreeding with archaic hominins, such as Neandertals, yet the influence of this admixture on human traits is largely unknown. We analyzed the contribution of common Neandertal variants to over 1000 electronic health record (EHR)–derived phenotypes in ~28,000 adults of European ancestry. We discovered and replicated associations of Neandertal alleles with neurological, psychiatric, immunological, and dermatological phenotypes. Neandertal alleles together explained a significant fraction of the variation in risk for depression and skin lesions resulting from sun exposure (actinic keratosis), and individual Neandertal alleles were significantly associated with specific human phenotypes, including hypercoagulation and tobacco use. Our results establish that archaic admixture influences disease risk in modern humans, provide hypotheses about the effects of hundreds of Neandertal haplotypes, and demonstrate the utility of EHR data in evolutionary analyses.

Journal Article

Share this book

Add to My Shelf

GWAS and enrichment analyses of non-alcoholic fatty liver disease identify new trait-associated genes and pathways across eMERGE Network

by Larson, Eric B. , Jarvik, Gail P. , Xanthakos, Stavra A. in Acyltransferases , Adult , Aged

2019

Background Non-alcoholic fatty liver disease (NAFLD) is a common chronic liver illness with a genetically heterogeneous background that can be accompanied by considerable morbidity and attendant health care costs. The pathogenesis and progression of NAFLD is complex with many unanswered questions. We conducted genome-wide association studies (GWASs) using both adult and pediatric participants from the Electronic Medical Records and Genomics (eMERGE) Network to identify novel genetic contributors to this condition. Methods First, a natural language processing (NLP) algorithm was developed, tested, and deployed at each site to identify 1106 NAFLD cases and 8571 controls and histological data from liver tissue in 235 available participants. These include 1242 pediatric participants (396 cases, 846 controls). The algorithm included billing codes, text queries, laboratory values, and medication records. Next, GWASs were performed on NAFLD cases and controls and case-only analyses using histologic scores and liver function tests adjusting for age, sex, site, ancestry, PC, and body mass index (BMI). Results Consistent with previous results, a robust association was detected for the PNPLA3 gene cluster in participants with European ancestry. At the PNPLA3-SAMM50 region, three SNPs, rs738409, rs738408, and rs3747207, showed strongest association (best SNP rs738409 p = 1.70 × 10 − 20 ). This effect was consistent in both pediatric ( p = 9.92 × 10 − 6 ) and adult ( p = 9.73 × 10 − 15 ) cohorts. Additionally, this variant was also associated with disease severity and NAFLD Activity Score (NAS) ( p = 3.94 × 10 − 8 , beta = 0.85). PheWAS analysis link this locus to a spectrum of liver diseases beyond NAFLD with a novel negative correlation with gout ( p = 1.09 × 10 − 4 ). We also identified novel loci for NAFLD disease severity, including one novel locus for NAS score near IL17RA (rs5748926, p = 3.80 × 10 − 8 ), and another near ZFP90-CDH1 for fibrosis (rs698718, p = 2.74 × 10 − 11 ). Post-GWAS and gene-based analyses identified more than 300 genes that were used for functional and pathway enrichment analyses. Conclusions In summary, this study demonstrates clear confirmation of a previously described NAFLD risk locus and several novel associations. Further collaborative studies including an ethnically diverse population with well-characterized liver histologic features of NAFLD are needed to further validate the novel findings.

Journal Article

Share this book

Add to My Shelf

Frequency of genomic secondary findings among 21,915 eMERGE network participants

by Rehm, Heidi L. , Linder, Jodell E. , Rasmussen-Torvik, Laura J. in Biomedical and Life Sciences , Biomedicine , Consent

2020

Purpose Discovering an incidental finding (IF) or secondary finding (SF) is a potential result of genomic testing, but few data exist describing types and frequencies of SFs likely to appear in broader clinical populations. Methods The Electronic Medical Records and Genomics Network Phase III (eMERGE III) developed a CLIA-compliant sequencing panel of 109 genes and 1551 variants of clinical relevance or research interest and deployed this panel at ten clinical sites. We evaluated medically actionable SFs across 67 genes and 14 single-nucleotide variants (SNVs) in a diverse cohort of 21,915 participants drawn from a variety of settings (e.g., primary care, biobanks, specialty clinics). Results Correcting for testing indication, we found a 3.02% overall frequency of SFs; 2.54% from 59 genes the American College of Medical Genetics and Genomics recommends for SF return, and 0.48% in other genes, primarily HFE and PALB2 . SFs associated with cancer susceptibility were most frequent (1.38%), followed by cardiovascular diseases (0.87%), and lipid disorders (0.50%). After removing HFE , the frequency of SFs and proportion of pathogenic versus likely pathogenic SFs did not differ in those self-identifying as White versus others. Conclusion Here we present frequencies and types of medically actionable secondary findings to support informed decision making by patients, participants, and practitioners engaged in genomic medicine.

Journal Article

Share this book

Add to My Shelf

Multi-ancestry genome- and phenome-wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm

by Peissig, Peggy L. , Rasmussen-Torvik, Laura J. , Borthwick, Kenneth M. in Abdomen , Algorithms , Analysis

2023

Diverticular disease (DD) is one of the most prevalent conditions encountered by gastroenterologists, affecting ~50% of Americans before the age of 60. Our aim was to identify genetic risk variants and clinical phenotypes associated with DD, leveraging multiple electronic health record (EHR) data sources of 91,166 multi-ancestry participants with a Natural Language Processing (NLP) technique. We developed a NLP-enriched phenotyping algorithm that incorporated colonoscopy or abdominal imaging reports to identify patients with diverticulosis and diverticulitis from multicenter EHRs. We performed genome-wide association studies (GWAS) of DD in European, African and multi-ancestry participants, followed by phenome-wide association studies (PheWAS) of the risk variants to identify their potential comorbid/pleiotropic effects in clinical phenotypes. Our developed algorithm showed a significant improvement in patient classification performance for DD analysis (algorithm PPVs ≥ 0.94), with up to a 3.5 fold increase in terms of the number of identified patients than the traditional method. Ancestry-stratified analyses of diverticulosis and diverticulitis of the identified subjects replicated the well-established associations between ARHGAP15 loci with DD, showing overall intensified GWAS signals in diverticulitis patients compared to diverticulosis patients. Our PheWAS analyses identified significant associations between the DD GWAS variants and circulatory system, genitourinary, and neoplastic EHR phenotypes. As the first multi-ancestry GWAS-PheWAS study, we showcased that heterogenous EHR data can be mapped through an integrative analytical pipeline and reveal significant genotype-phenotype associations with clinical interpretation. A systematic framework to process unstructured EHR data with NLP could advance a deep and scalable phenotyping for better patient identification and facilitate etiological investigation of a disease with multilayered data.

Journal Article

Share this book

Add to My Shelf

Negation’s Not Solved: Generalizability Versus Optimizability in Clinical Natural Language Processing

by Clark, Cheryl , Miller, Timothy , Coarr, Matt in Adaptation , Algorithms , Annotations

2014

A review of published work in clinical natural language processing (NLP) may suggest that the negation detection task has been \"solved.\" This work proposes that an optimizable solution does not equal a generalizable solution. We introduce a new machine learning-based Polarity Module for detecting negation in clinical text, and extensively compare its performance across domains. Using four manually annotated corpora of clinical text, we show that negation detection performance suffers when there is no in-domain development (for manual methods) or training data (for machine learning-based methods). Various factors (e.g., annotation guidelines, named entity characteristics, the amount of data, and lexical and syntactic context) play a role in making generalizability difficult, but none completely explains the phenomenon. Furthermore, generalizability remains challenging because it is unclear whether to use a single source for accurate data, combine all sources into a single model, or apply domain adaptation methods. The most reliable means to improve negation detection is to manually annotate in-domain training data (or, perhaps, manually modify rules); this is a strategy for optimizing performance, rather than generalizing it. These results suggest a direction for future work in domain-adaptive and task-adaptive methods for clinical NLP.

Journal Article

Share this book

Add to My Shelf

Identification of Four Novel Loci in Asthma in European American and African American Populations

by Harley, John B. , Peissig, Peggy L. , Mentch, Frank in Adolescent , Adult , African Americans

2017

Abstract Rationale Despite significant advances in knowledge of the genetic architecture of asthma, specific contributors to the variability in the burden between populations remain uncovered. Objectives To identify additional genetic susceptibility factors of asthma in European American and African American populations. Methods A phenotyping algorithm mining electronic medical records was developed and validated to recruit cases with asthma and control subjects from the Electronic Medical Records and Genomics network. Genome-wide association analyses were performed in pediatric and adult asthma cases and control subjects with European American and African American ancestry followed by metaanalysis. Nominally significant results were reanalyzed conditioning on allergy status. Measurements and Main Results The validation of the algorithm yielded an average of 95.8% positive predictive values for both cases and control subjects. The algorithm accrued 21,644 subjects (65.83% European American and 34.17% African American). We identified four novel population-specific associations with asthma after metaanalyses: loci 6p21.31, 9p21.2, and 10q21.3 in the European American population, and the PTGES gene in African Americans. TEK at 9p21.2, which encodes TIE2, has been shown to be involved in remodeling the airway wall in asthma, and the association remained significant after conditioning by allergy. PTGES, which encodes the prostaglandin E synthase, has also been linked to asthma, where deficient prostaglandin E2 synthesis has been associated with airway remodeling. Conclusions This study adds to understanding of the genetic architecture of asthma in European Americans and African Americans and reinforces the need to study populations of diverse ethnic backgrounds to identify shared and unique genetic predictors of asthma.

Journal Article

Share this book

Add to My Shelf

Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data

by Roden, Dan M , Masys, Daniel R , Zink, Raquel in 631/208/205 , 692/308/575 , 692/699

2013

When applied in large scale to electronic medical record data, the PheWAS approach replicates GWAS associations and reveals potentially new pleiotropic associations. Candidate gene and genome-wide association studies (GWAS) have identified genetic variants that modulate risk for human disease; many of these associations require further study to replicate the results. Here we report the first large-scale application of the phenome-wide association study (PheWAS) paradigm within electronic medical records (EMRs), an unbiased approach to replication and discovery that interrogates relationships between targeted genotypes and multiple phenotypes. We scanned for associations between 3,144 single-nucleotide polymorphisms (previously implicated by GWAS as mediators of human traits) and 1,358 EMR-derived phenotypes in 13,835 individuals of European ancestry. This PheWAS replicated 66% (51/77) of sufficiently powered prior GWAS associations and revealed 63 potentially pleiotropic associations with P < 4.6 × 10 −6 (false discovery rate < 0.1); the strongest of these novel associations were replicated in an independent cohort ( n = 7,406). These findings validate PheWAS as a tool to allow unbiased interrogation across multiple phenotypes in EMR-based cohorts and to enhance analysis of the genomic basis of human disease.

Journal Article

Share this book

Add to My Shelf

Development of a machine learning model to predict mild cognitive impairment using natural language processing in the absence of screening

by Penfold, Robert B. , Thompson, Ella , Pabiniak, Chester in Alzheimer Disease - diagnosis , Alzheimer's disease , Cognition & reasoning

2022

Background Patients and their loved ones often report symptoms or complaints of cognitive decline that clinicians note in free clinical text, but no structured screening or diagnostic data are recorded. These symptoms/complaints may be signals that predict who will go on to be diagnosed with mild cognitive impairment (MCI) and ultimately develop Alzheimer’s Disease or related dementias. Our objective was to develop a natural language processing system and prediction model for identification of MCI from clinical text in the absence of screening or other structured diagnostic information. Methods There were two populations of patients: 1794 participants in the Adult Changes in Thought (ACT) study and 2391 patients in the general population of Kaiser Permanente Washington. All individuals had standardized cognitive assessment scores. We excluded patients with a diagnosis of Alzheimer’s Disease, Dementia or use of donepezil. We manually annotated 10,391 clinic notes to train the NLP model. Standard Python code was used to extract phrases from notes and map each phrase to a cognitive functioning concept. Concepts derived from the NLP system were used to predict future MCI. The prediction model was trained on the ACT cohort and 60% of the general population cohort with 40% withheld for validation. We used a least absolute shrinkage and selection operator logistic regression approach (LASSO) to fit a prediction model with MCI as the prediction target. Using the predicted case status from the LASSO model and known MCI from standardized scores, we constructed receiver operating curves to measure model performance. Results Chart abstraction identified 42 MCI concepts. Prediction model performance in the validation data set was modest with an area under the curve of 0.67. Setting the cutoff for correct classification at 0.60, the classifier yielded sensitivity of 1.7%, specificity of 99.7%, PPV of 70% and NPV of 70.5% in the validation cohort. Discussion and conclusion Although the sensitivity of the machine learning model was poor, negative predictive value was high, an important characteristic of models used for population-based screening. While an AUC of 0.67 is generally considered moderate performance, it is also comparable to several tests that are widely used in clinical practice.

Journal Article

Share this book

Add to My Shelf

Rare coding variants in the phospholipase D3 gene confer risk for Alzheimer’s disease

by Ryten, Mina , Skorupa, Tara , Sassi, Celeste in 13/106 , 13/44 , 14/1

2014

Whole-exome sequencing reveals that a rare variant of phospholipase D3 ( PLD3 ( V232M )) segregates with Alzheimer’s disease status in two independent families and doubles risk for the disease in case–control series, and that several other PLD3 variants increase risk for Alzheimer’s disease in African Americans and people of European descent. New genetic risk variant for Alzheimer's disease The identification of mutations causing Alzheimer's disease in amyloid-β precursor protein, presenilin 1 and presenilin 2 led to a better understanding of the pathobiology of the condition. Further mutations are expected to be implicated, but the identification of such variants has been challenging. These authors used exome sequencing to identify low-frequency coding variants with large effects on late-onset Alzheimer's disease. They report several coding variants in the gene PLD3 , coding for phospholipase D3, that increase disease risk at least twofold. PLD3 may have a role in the processing of amyloid-β and may have potential as a novel therapeutic target. Genome-wide association studies (GWAS) have identified several risk variants for late-onset Alzheimer's disease (LOAD) 1 , 2 . These common variants have replicable but small effects on LOAD risk and generally do not have obvious functional effects. Low-frequency coding variants, not detected by GWAS, are predicted to include functional variants with larger effects on risk. To identify low-frequency coding variants with large effects on LOAD risk, we carried out whole-exome sequencing (WES) in 14 large LOAD families and follow-up analyses of the candidate variants in several large LOAD case–control data sets. A rare variant in PLD3 (phospholipase D3; Val232Met) segregated with disease status in two independent families and doubled risk for Alzheimer’s disease in seven independent case–control series with a total of more than 11,000 cases and controls of European descent. Gene-based burden analyses in 4,387 cases and controls of European descent and 302 African American cases and controls, with complete sequence data for PLD3 , reveal that several variants in this gene increase risk for Alzheimer’s disease in both populations. PLD3 is highly expressed in brain regions that are vulnerable to Alzheimer’s disease pathology, including hippocampus and cortex, and is expressed at significantly lower levels in neurons from Alzheimer’s disease brains compared to control brains. Overexpression of PLD3 leads to a significant decrease in intracellular amyloid-β precursor protein (APP) and extracellular Aβ42 and Aβ40 (the 42- and 40-residue isoforms of the amyloid-β peptide), and knockdown of PLD3 leads to a significant increase in extracellular Aβ42 and Aβ40. Together, our genetic and functional data indicate that carriers of PLD3 coding variants have a twofold increased risk for LOAD and that PLD3 influences APP processing. This study provides an example of how densely affected families may help to identify rare variants with large effects on risk for disease or other complex traits.

Journal Article

Share this book

Add to My Shelf

Development and validation of a machine learning model to identify individuals at high risk for psychotic disorders using medical record data

by Ramaprasan, Arvind , Penfold, Robert B. , Durojaiye, Cimone in Calibration , Data integrity , Diagnosis

2026

Background Reducing the duration of untreated psychosis among individuals with early psychosis is associated with improved clinical outcomes and decreased long-term impairment. However, timely identification of individuals at high risk for psychotic disorders in routine clinical practice is challenging, and many individuals are only identified several years following psychotic-symptom onset. This study aimed to leverage comprehensive electronic medical records to develop and validate a machine learning model to identify individuals at high risk of conversion to a psychotic-spectrum disorder (PSD). Methods This was a cross-sectional, retrospective analysis of electronic health record (EHR) data consisting of clinician free-text documentation and structured data (i.e., age, sex, race/ethnicity, psychiatric diagnoses, encounter modality, and department) among 406,268 Kaiser Permanente Northern California members aged 15–29 years with ≥ 1 primary-care encounter between 2017 and 2019 (~ 1,694,531 encounters). Patients with a new-onset PSD were distinguished from those without a diagnosis if they had ≥ 1 PSD diagnosis within 12 months following the index primary care encounter. The prediction models were developed using cross-validation with the gradient boosting and elastic net algorithms on features extracted from notes, and validated in a random test set. Results A gradient-boosting model including text features model yielded the highest area under the curve (AUC 0.827 [95% CI: 0.799 to 0.853]), outperforming an elastic-net model (AUC 0.791 [95% CI 0.760 to 0.821]) and a gradient-boosting model that incorporated only discrete variables (AUC 0.610 [95% CI 0.595 to 0.626]). Model performance was similar across subgroups by sex, age, and race/ethnicity. However, all models exhibited suboptimal calibration, with predicted probabilities systematically underestimating observed PSD risk. Increasing the ratio of PSD cases to non-cases improved discrimination, but worsened calibration. Further, predicted probabilities of developing a PSD compressed with imbalance, causing abrupt metric drops at higher thresholds. Conclusions This study suggests that individuals at elevated risk of developing a PSD may be identified from a general clinical population using a machine-learning model trained on routine clinical documentation and structured EHR data. However, the low incidence of PSDs led to suboptimal calibration. Future studies may restrict prediction to populations with higher PSD incidence, such as mental health clinics, to improve model calibration. Clinical trial number Not applicable. Trial registration Not applicable.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter