Catalogue Search | MBRL

Do little interactions get lost in dark random forests?

by König, Inke R. , Ziegler, Andreas , Wright, Marvin N. in Algorithms , Bioinformatics , Biomedical and Life Sciences

2016

Background Random forests have often been claimed to uncover interaction effects. However, if and how interaction effects can be differentiated from marginal effects remains unclear. In extensive simulation studies, we investigate whether random forest variable importance measures capture or detect gene-gene interactions. With capturing interactions, we define the ability to identify a variable that acts through an interaction with another one, while detection is the ability to identify an interaction effect as such. Results Of the single importance measures, the Gini importance captured interaction effects in most of the simulated scenarios, however, they were masked by marginal effects in other variables. With the permutation importance, the proportion of captured interactions was lower in all cases. Pairwise importance measures performed about equal, with a slight advantage for the joint variable importance method. However, the overall fraction of detected interactions was low. In almost all scenarios the detection fraction in a model with only marginal effects was larger than in a model with an interaction effect only. Conclusions Random forests are generally capable of capturing gene-gene interactions, but current variable importance measures are unable to detect them as interactions. In most of the cases, interactions are masked by marginal effects and interactions cannot be differentiated from marginal effects. Consequently, caution is warranted when claiming that random forests uncover interactions.

Journal Article

Share this book

Add to My Shelf

Risk estimation and risk prediction using machine-learning methods

by König, Inke R. , Ziegler, Andreas , Kruppa, Jochen in Algorithms , Arthritis, Rheumatoid - genetics , Artificial Intelligence

2012

After an association between genetic variants and a phenotype has been established, further study goals comprise the classification of patients according to disease risk or the estimation of disease probability. To accomplish this, different statistical methods are required, and specifically machine-learning approaches may offer advantages over classical techniques. In this paper, we describe methods for the construction and evaluation of classification and probability estimation rules. We review the use of machine-learning approaches in this context and explain some of the machine-learning algorithms in detail. Finally, we illustrate the methodology through application to a genome-wide association analysis on rheumatoid arthritis.

Journal Article

Share this book

Add to My Shelf

Empowering individual trait prediction using interactions for precision medicine

by König, Inke R. , Gola, Damian in Algorithms , Arthritis , Autoimmune diseases

2021

Background One component of precision medicine is to construct prediction models with their predicitve ability as high as possible, e.g. to enable individual risk prediction. In genetic epidemiology, complex diseases like coronary artery disease, rheumatoid arthritis, and type 2 diabetes, have a polygenic basis and a common assumption is that biological and genetic features affect the outcome under consideration via interactions. In the case of omics data, the use of standard approaches such as generalized linear models may be suboptimal and machine learning methods are appealing to make individual predictions. However, most of these algorithms focus mostly on main or marginal effects of the single features in a dataset. On the other hand, the detection of interacting features is an active area of research in the realm of genetic epidemiology. One big class of algorithms to detect interacting features is based on the multifactor dimensionality reduction (MDR). Here, we further develop the model-based MDR (MB-MDR), a powerful extension of the original MDR algorithm, to enable interaction empowered individual prediction. Results Using a comprehensive simulation study we show that our new algorithm (median AUC: 0.66) can use information hidden in interactions and outperforms two other state-of-the-art algorithms, namely the Random Forest (median AUC: 0.54) and Elastic Net (median AUC: 0.50), if interactions are present in a scenario of two pairs of two features having small effects. The performance of these algorithms is comparable if no interactions are present. Further, we show that our new algorithm is applicable to real data by comparing the performance of the three algorithms on a dataset of rheumatoid arthritis cases and healthy controls. As our new algorithm is not only applicable to biological/genetic data but to all datasets with discrete features, it may have practical implications in other research fields where interactions between features have to be considered as well, and we made our method available as an R package ( https://github.com/imbs-hl/MBMDRClassifieR ). Conclusions The explicit use of interactions between features can improve the prediction performance and thus should be included in further attempts to move precision medicine forward.

Journal Article

Share this book

Add to My Shelf

Microbiota-based analysis reveals specific bacterial traits and a novel strategy for the diagnosis of infectious infertility

by König, Inke R. , Hoellen, Friederike , Lettau, Reinhard in Antibodies , Bacteria - genetics , Bacteria - isolation & purification

2018

Tubal factor infertility (TFI) accounts for more than 30% of the cases of female infertility and mostly resides from an inflammatory process triggered by an infection. Clinical appearances largely differ, and very often infections are not recognized or remain completely asymptomatic over time. Here, we characterized the microbial pattern in females diagnosed with infectious infertility (ININF) in comparison to females with non-infectious infertility (nININF), female sex workers (FSW) and healthy controls (fertile). Females diagnosed with infectious infertility differed significantly in the seroprevalence of IgG antibodies against the C. trachomatis proteins MOMP, OMP2, CPAF and HSP60 when compared to fertile females. Microbiota analysis using 16S amplicon sequencing of cervical swabs revealed significant differences between ININF and fertile controls in the relative read count of Gardnerella (10.08% vs. 5.43%). Alpha diversity varies among groups, which are characterized by community state types including Lactobacillus-dominated communities in fertile females, an increase in diversity in all the other groups and Gardnerella-dominated communities occurring more often in ININF. While all single parameters did not allow predicting infections as the cause of infertility, including C. trachomatis IgG/IgA status together with 16S rRNA gene analysis of the ten most frequent taxa a total of 93.8% of the females were correctly classified. Further studies are needed to unravel the impact of the cervical microbiota in the pathogenesis of infectious infertility and its potential for identifying females at risk earlier in life.

Journal Article

Share this book

Add to My Shelf

Lifestyle factors and clinical severity of Parkinson’s disease

by König, Inke R. , Gabbert, Carolin , Lüth, Theresa in 692/1807 , 692/499 , 692/699/375/1718

2023

Genetic factors, environmental factors, and gene–environment interactions have been found to modify PD risk, age at onset (AAO), and disease progression. The objective of this study was to explore the association of coffee drinking, aspirin intake, and smoking, with motor and non-motor symptoms in a cohort of 35,959 American patients with PD from the Fox Insight Study using generalized linear models. Coffee drinkers had fewer problems swallowing but dosage and duration of coffee intake were not associated with motor or non-motor symptoms. Aspirin intake correlated with more tremor (p = 0.0026), problems getting up (p = 0.0185), light-headedness (p = 0.0043), and problems remembering (p = 1 × 10 –5 ). Smoking was directly associated with symptoms: smokers had more problems with drooling (p = 0.0106), swallowing (p = 0.0002), and freezing (p < 1 × 10 –5 ). Additionally, smokers had more possibly mood-related symptoms: unexplained pains (p < 1 × 10 –5 ), problems remembering (p = 0.0001), and feeling sad (p < 1 × 10 –5 ). Confirmatory and longitudinal studies are warranted to investigate the clinical correlation over time.

Journal Article

Share this book

Add to My Shelf

The combined effect of lifestyle factors and polygenic scores on age at onset in Parkinson’s disease

by König, Inke R. , Gabbert, Carolin , Caliebe, Amke in 631/208/727 , 692/499 , 692/699/375/1718

2024

The objective of this study was to investigate the association between a Parkinson’s disease (PD)-specific polygenic score (PGS) and protective lifestyle factors on age at onset (AAO) in PD. We included data from 4367 patients with idiopathic PD, 159 patients with GBA1 -PD, and 3090 healthy controls of European ancestry from AMP-PD, PPMI, and Fox Insight cohorts. The association between PGS and lifestyle factors on AAO was assessed with linear and Cox proportional hazards models. The PGS showed a negative association with AAO ( β = − 1.07, p = 6 × 10 –7 ) in patients with idiopathic PD. The use of one, two, or three of the protective lifestyle factors showed a reduction in the hazard ratio by 21% ( p = 0.0001), 44% ( p < 2 × 10 –16 ), and 55% ( p < 2 × 10 –16 ), compared to no use. An additive effect of aspirin ( β = 7.62, p = 9 × 10 –7 ) and PGS ( β = − 1.58, p = 0.0149) was found for AAO without an interaction ( p = 0.9993) in the linear regressions, and similar effects were seen for tobacco. In contrast, no association between aspirin intake and AAO was found in GBA1 -PD ( p > 0.05). In our cohort, coffee, tobacco, aspirin, and PGS are independent predictors of PD AAO. Additionally, lifestyle factors seem to have a greater influence on AAO than common genetic risk variants with aspirin presenting the largest effect.

Journal Article

Share this book

Add to My Shelf

Metamizole and the risk of drug-induced agranulocytosis and neutropenia in statutory health insurance data

by Klose, Sebastian , Pflock René , Schwaninger Markus in Analgesics , Chemotherapy , Chronic pain

2020

The non-opioid analgesic metamizole (dipyrone) is used for the treatment of acute and chronic pain and fever. Agranulocytosis is known as a serious adverse drug reaction of metamizole with potentially fatal outcome. However, its frequency is controversially discussed. The aim of our study was to determine the risk of metamizole-associated agranulocytosis and neutropenia using statutory health insurance data. We analyzed data from a large German health insurance fund in the period from 2010 to 2013. Metamizole-exposed subjects were identified and compared to a propensity score-matched control cohort. A total of 630,285 metamizole-treated subjects and 390,830 matched control subjects were included. In the metamizole cohort, ICD codes for agranulocytosis and neutropenia appeared more often than in non-users. The relative risk for drug-induced agranulocytosis and neutropenia (D70.1) was 3.03 (95% confidence interval, 2.49 to 3.69). The risk for developing drug-induced agranulocytosis and neutropenia after metamizole prescription was 1: 1602 (CI 95%, 1:1926 to 1:1371). Our results confirm the risk estimation of previous studies. However, the outcome of our study may be confounded by an association of metamizole treatment and chemotherapy. Therefore, consequences for treatment have to be drawn with care.

Journal Article

Share this book

Add to My Shelf

Identification of representative trees in random forests based on a new tree-based distance measure

by König, Inke R. , Westenberger, Ana , Laabs, Björn-Hergen in Algorithms , Chemistry and Earth Sciences , Classification

2024

In life sciences, random forests are often used to train predictive models. However, gaining any explanatory insight into the mechanics leading to a specific outcome is rather complex, which impedes the implementation of random forests into clinical practice. By simplifying a complex ensemble of decision trees to a single most representative tree, it is assumed to be possible to observe common tree structures, the importance of specific features and variable interactions. Thus, representative trees could also help to understand interactions between genetic variants. Intuitively, representative trees are those with the minimal distance to all other trees, which requires a proper definition of the distance between two trees. Thus, we developed a new tree-based distance measure, which incorporates more of the underlying tree structure than other metrics. We compared our new method with the existing metrics in an extensive simulation study and applied it to predict the age at onset based on a set of genetic risk factors in a clinical data set. In our simulation study we were able to show the advantages of our weighted splitting variable approach. Our real data application revealed that representative trees are not only able to replicate the results from a recent genome-wide association study, but also can give additional explanations of the genetic mechanisms. Finally, we implemented all compared distance measures in R and made them publicly available in the R package timbR ( https://github.com/imbs-hl/timbR ).

Journal Article

Share this book

Add to My Shelf

A survey of genome-wide association studies, polygenic scores and UK Biobank highlights resources for autoimmune disease genetics

by König, Inke R. , Wohlers, Inken , Saurabh, Rochi in Ankylosing spondylitis , Autoantigens , autoimmune disease

2022

Autoimmune diseases share a general mechanism of auto-antigens harming tissues. Still. they are phenotypically diverse, with genetic as well as environmental factors contributing to their etiology at varying degrees. Associated genomic loci and variants have been identified in numerous genome-wide association studies (GWAS), whose results are increasingly used for polygenic scores (PGS) that are used to predict disease risk. At the same time, a technological shift from genotyping arrays to next generation sequencing (NGS) is ongoing. NGS allows the identification of virtually all - including rare - genetic variants, which in combination with methodological developments promises to improve the prediction of disease risk and elucidate molecular mechanisms underlying disease. Here we review current, publicly available autoimmune disease GWAS and PGS data based on information from the GWAS and PGS catalog, respectively. We summarize autoimmune diseases investigated, respective studies conducted and their results. Further, we review genetic data and autoimmune disease patients in the UK Biobank (UKB), the largest resource for genetic and phenotypic data available for academic research. We find that only comparably prevalent autoimmune diseases are covered by the UKB and at the same time assessed by both GWAS and PGS catalogs. These are systemic (systemic lupus erythematosus) as well as organ-specific, affecting the gastrointestinal tract (inflammatory bowel disease as well as specifically Crohn’s disease and ulcerative colitis), joints (juvenile ideopathic arthritis, psoriatic arthritis, rheumatoid arthritis, ankylosing spondylitis), glands (Sjögren syndrome), the nervous system (multiple sclerosis), and the skin (vitiligo).

Journal Article

Share this book

Add to My Shelf

Splitting on categorical predictors in random forests

by König, Inke R. , Wright, Marvin N. in Artificial intelligence , Categorical predictors , Classification

2019

One reason for the widespread success of random forests (RFs) is their ability to analyze most datasets without preprocessing. For example, in contrast to many other statistical methods and machine learning approaches, no recoding such as dummy coding is required to handle ordinal and nominal predictors. The standard approach for nominal predictors is to consider all 2 k − 1 − 1 2-partitions of the k predictor categories. However, this exponential relationship produces a large number of potential splits to be evaluated, increasing computational complexity and restricting the possible number of categories in most implementations. For binary classification and regression, it was shown that ordering the predictor categories in each split leads to exactly the same splits as the standard approach. This reduces computational complexity because only k − 1 splits have to be considered for a nominal predictor with k categories. For multiclass classification and survival prediction no ordering method producing equivalent splits exists. We therefore propose to use a heuristic which orders the categories according to the first principal component of the weighted covariance matrix in multiclass classification and by log-rank scores in survival prediction. This ordering of categories can be done either in every split or a priori, that is, just once before growing the forest. With this approach, the nominal predictor can be treated as ordinal in the entire RF procedure, speeding up the computation and avoiding category limits. We compare the proposed methods with the standard approach, dummy coding and simply ignoring the nominal nature of the predictors in several simulation settings and on real data in terms of prediction performance and computational efficiency. We show that ordering the categories a priori is at least as good as the standard approach of considering all 2-partitions in all datasets considered, while being computationally faster. We recommend to use this approach as the default in RFs.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter