Catalogue Search | MBRL

Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies

by Dey, Rounak , Fritsche, Lars G. , Wolford, Brooke N. in 45/43 , 631/208/205/2138 , 639/705/531

2018

In genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, the linear mixed model and the recently proposed logistic mixed model, perform poorly; they produce large type I error rates when used to analyze unbalanced case-control phenotypes. Here we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation to calibrate the distribution of score test statistics. This method, SAIGE (Scalable and Accurate Implementation of GEneralized mixed model), provides accurate P values even when case-control ratios are extremely unbalanced. SAIGE uses state-of-art optimization strategies to reduce computational costs; hence, it is applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 samples from white British participants with European ancestry for > 1,400 binary phenotypes, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness. SAIGE (Scalable and Accurate Implementation of GEneralized mixed model) is a generalized mixed model association test that can efficiently analyze large data sets while controlling for unbalanced case-control ratios and sample relatedness, as shown by applying SAIGE to the UK Biobank data for > 1,400 binary phenotypes.

Journal Article

Share this book

Add to My Shelf

Construction of habitat-specific training sets to achieve species-level assignment in 16S rRNA gene datasets

by Chen, Tsute , Kokaras, Alexis , Huang, Yanmei in 16S rRNA gene , Algorithms , Bacteria - genetics

2020

Background The low cost of 16S rRNA gene sequencing facilitates population-scale molecular epidemiological studies. Existing computational algorithms can resolve 16S rRNA gene sequences into high-resolution amplicon sequence variants (ASVs), which represent consistent labels comparable across studies. Assigning these ASVs to species-level taxonomy strengthens the ecological and/or clinical relevance of 16S rRNA gene-based microbiota studies and further facilitates data comparison across studies. Results To achieve this, we developed a broadly applicable method for constructing high-resolution training sets based on the phylogenic relationships among microbes found in a habitat of interest. When used with the naïve Bayesian Ribosomal Database Project (RDP) Classifier, this training set achieved species/supraspecies-level taxonomic assignment of 16S rRNA gene-derived ASVs. The key steps for generating such a training set are (1) constructing an accurate and comprehensive phylogenetic-based, habitat-specific database; (2) compiling multiple 16S rRNA gene sequences to represent the natural sequence variability of each taxon in the database; (3) trimming the training set to match the sequenced regions, if necessary; and (4) placing species sharing closely related sequences into a training-set-specific supraspecies taxonomic level to preserve subgenus-level resolution. As proof of principle, we developed a V1–V3 region training set for the bacterial microbiota of the human aerodigestive tract using the full-length 16S rRNA gene reference sequences compiled in our expanded Human Oral Microbiome Database ( eHOMD ). We also overcame technical limitations to successfully use Illumina sequences for the 16S rRNA gene V1–V3 region, the most informative segment for classifying bacteria native to the human aerodigestive tract. Finally, we generated a full-length eHOMD 16S rRNA gene training set, which we used in conjunction with an independent PacBio single molecule, real-time (SMRT)-sequenced sinonasal dataset to validate the representation of species in our training set. This also established the effectiveness of a full-length training set for assigning taxonomy of long-read 16S rRNA gene datasets. Conclusion Here, we present a systematic approach for constructing a phylogeny-based, high-resolution, habitat-specific training set that permits species/supraspecies-level taxonomic assignment to short- and long-read 16S rRNA gene-derived ASVs. This advancement enhances the ecological and/or clinical relevance of 16S rRNA gene-based microbiota studies. DgoJmEpwWqGYbvQenHYPjX Video Abstract

Journal Article

Share this book

Add to My Shelf

APOBEC Mutagenesis Is Concordant between Tumor and Viral Genomes in HPV-Positive Head and Neck Squamous Cell Carcinoma

by Bass, Sara , Langenbucher, Adam , Ferris, Robert L. in Adult , Aged , alleles

2021

APOBEC is a mutagenic source in human papillomavirus (HPV)-mediated malignancies, including HPV+ oropharyngeal squamous cell carcinoma (HPV + OPSCC), and in HPV genomes. It is unknown why APOBEC mutations predominate in HPV + OPSCC, or if the APOBEC-induced mutations observed in both human cancers and HPV genomes are directly linked. We performed sequencing of host somatic exomes, transcriptomes, and HPV16 genomes from 79 HPV + OPSCC samples, quantifying APOBEC mutational burden and activity in both host and virus. APOBEC was the dominant mutational signature in somatic exomes. In viral genomes, there was a mean of five (range 0–29) mutations per genome. The mean of APOBEC mutations in viral genomes was one (range 0–5). Viral APOBEC mutations, compared to non-APOBEC mutations, were more likely to be low-variant allele fraction mutations, suggesting that APOBEC mutagenesis actively occurrs in viral genomes during infection. HPV16 APOBEC-induced mutation patterns in OPSCC were similar to those previously observed in cervical samples. Paired host and viral analyses revealed that APOBEC-enriched tumor samples had higher viral APOBEC mutation rates (p = 0.028), and APOBEC-associated RNA editing (p = 0.008), supporting the concept that APOBEC mutagenesis in host and viral genomes is directly linked and occurrs during infection. Using paired sequencing of host somatic exomes, transcriptomes, and viral genomes, we demonstrated for the first-time definitive evidence of concordance between tumor and viral APOBEC mutagenesis. This finding provides a missing link connecting APOBEC mutagenesis in host and virus and supports a common mechanism driving APOBEC dysregulation.

Journal Article

Share this book

Add to My Shelf

Single-cell transcriptomic profiling for inferring tumor origin and mechanisms of therapeutic resistance

by Sade-Feldman, Moshe , Faden, Daniel L. , Lin, Maoxuan in 631/67/1536/1665 , 631/67/69 , 692/308/575

2022

Head and Neck Squamous Cell Carcinoma (HNSCC) is an aggressive epithelial cancer with poor overall response rates to checkpoint inhibitor therapy (CPI) despite CPI being the recommended treatment for recurrent or metastatic HNSCC. Mechanisms of resistance to CPI in HNSCC are poorly understood. To identify drivers of response and resistance to CPI in a unique patient who was believed to have developed three separate HNSCCs, we performed single-cell RNA-seq (scRNA-seq) profiling of two responding lesions and one progressive lesion that developed during CPI. Our results not only suggest interferon-induced APOBEC3-mediated acquired resistance as a mechanism of CPI resistance in the progressing lesion but further, that the lesion in question was actually a metastasis as opposed to a new primary tumor, highlighting the immense power of scRNA-seq as a clinical tool for inferring tumor origin and mechanisms of therapeutic resistance.

Journal Article

Share this book

Add to My Shelf

Effects of short indels on protein structure and function in human genomes

by Guo, Jun-tao , Farrel, Alvin , Shi, Xinghua in 631/114/2785 , 631/535 , Biological Variation, Population

2017

Insertions and deletions (indels) represent the second most common type of genetic variations in human genomes. Indels can be deleterious and contribute to disease susceptibility as recent genome sequencing projects revealed a large number of indels in various cancer types. In this study, we investigated the possible effects of small coding indels on protein structure and function, and the baseline characteristics of indels in 2504 individuals of 26 populations from the 1000 Genomes Project. We found that each population has a distinct pattern in genes with small indels. Frameshift (FS) indels are enriched in olfactory receptor activity while non-frameshift (NFS) indels are enriched in transcription-related proteins. Structural analysis of NFS indels revealed that they predominantly adopt coil or disordered conformations, especially in proteins with transcription-related NFS indels. These results suggest that the annotated coding indels from the 1000 Genomes Project, while contributing to genetic variations and phenotypic diversity, generally do not affect the core protein structures and have no deleterious effect on essential biological processes. In addition, we found that a number of reference genome annotations might need to be updated due to the high prevalence of annotated homozygous indels in the general population.

Journal Article

Share this book

Add to My Shelf

Genome-wide analysis yields new loci associating with aortic valve stenosis

by Brummett, Chad M. , Thorleifsson, Gudmar , Folkersen, Lasse in 38/77 , 45/23 , 45/43

2018

Aortic valve stenosis (AS) is the most common valvular heart disease, and valve replacement is the only definitive treatment. Here we report a large genome-wide association (GWA) study of 2,457 Icelandic AS cases and 349,342 controls with a follow-up in up to 4,850 cases and 451,731 controls of European ancestry. We identify two new AS loci, on chromosome 1p21 near PALMD (rs7543130; odds ratio (OR) = 1.20, P = 1.2 × 10 −22 ) and on chromosome 2q22 in TEX41 (rs1830321; OR = 1.15, P = 1.8 × 10 −13 ). Rs7543130 also associates with bicuspid aortic valve (BAV) (OR = 1.28, P = 6.6 × 10 −10 ) and aortic root diameter ( P = 1.30 × 10 −8 ), and rs1830321 associates with BAV (OR = 1.12, P = 5.3 × 10 −3 ) and coronary artery disease (OR = 1.05, P = 9.3 × 10 −5 ). The results implicate both cardiac developmental abnormalities and atherosclerosis-like processes in the pathogenesis of AS. We show that several pathways are shared by CAD and AS. Causal analysis suggests that the shared risk factors of Lp(a) and non-high-density lipoprotein cholesterol contribute substantially to the frequent co-occurence of these diseases. Aortic valve stenosis (AS) is the most common valvular heart disease. Here the authors identify two new AS loci that also associate with bicuspid aortic valve, aortic root diameter and/or coronary artery disease implicating both developmental abnormalities and atherosclerosis-like processes in AS.

Journal Article

Share this book

Add to My Shelf

Protein-altering and regulatory genetic variants near GATA4 implicated in bicuspid aortic valve

by Levasseur, Alexandra , Brummett, Chad M. , Folkersen, Lasse in 13/89 , 45/43 , 631/208/205/2138

2017

Bicuspid aortic valve (BAV) is a heritable congenital heart defect and an important risk factor for valvulopathy and aortopathy. Here we report a genome-wide association scan of 466 BAV cases and 4,660 age, sex and ethnicity-matched controls with replication in up to 1,326 cases and 8,103 controls. We identify association with a noncoding variant 151 kb from the gene encoding the cardiac-specific transcription factor, GATA4, and near-significance for p.Ser377Gly in GATA4 . GATA4 was interrupted by CRISPR-Cas9 in induced pluripotent stem cells from healthy donors. The disruption of GATA4 significantly impaired the transition from endothelial cells into mesenchymal cells, a critical step in heart valve development. Bicuspid aortic valve (BAV) is the most common human congenital cardiovascular malformation. Here, the authors perform a genome-wide association study for BAV and identify risk variants in the gene region of cardiac-specific transcription factor GATA4 and implicate GATA4 in heart valve development.

Journal Article

Share this book

Add to My Shelf

Toward Understanding Protein-DNA Interactions

by Lin, Maoxuan in Biochemistry , Bioinformatics , Biology

2020

Knowledge of protein-DNA interactions has important implications in understanding biological activities and developing therapeutic drugs. Two types of protein-DNA interactions exist: (1) interactions between double-stranded DNA-binding proteins (DSBs) and double-stranded DNA (dsDNA), and (2) those between single-stranded DNA-binding proteins (SSBs) and single-stranded DNA (ssDNA). DSB-dsDNA interactions have been extensively studied but are still not completely understood. In contrast, less attention has been paid to SSB-ssDNA interactions. To expand our knowledge of DSB-dsDNA interactions, we investigated the roles of individual DNA strands and protein secondary structure types in specific DSB-dsDNA recognition based on side chain-base hydrogen bonds. By comparing the contribution of each DNA strand to the overall binding specificity, we found that highly specific DSBs show balanced hydrogen bonding with each of the two DNA strands, while multispecific DSBs are generally biased towards one strand. In addition, amino acids involved in side chain-base hydrogen bonds in these two groups of proteins favor different secondary structure types. To advance our understanding of SSB-ssDNA interactions, we performed a comparative structural analysis on known SSB-ssDNA complex structures. Structural features such as DNA binding propensities and secondary structure types of amino acids involved in SSB-ssDNA interactions, proteinDNA contact area, residue-base contacts, protein-ssDNA hydrogen bonding and π-π interactions, were analyzed and compared between specific and non-specific ssDNAbinding proteins. Our results suggest that side chain-base hydrogen bonds play major roles in protein-ssDNA binding specificity, while protein-ssDNA π-πinteractions may contribute to binding affinity. In addition, bound and unbound conformations of the same ssDNA-binding domains were compared to investigate the conformational changes upon ssDNA binding, and the results indicate that conformational changes of ssDNA-binding proteins might not be a major contributor in conferring binding specificity. These studies provide new insights into the mechanisms of specific proteinDNA interactions and can help therapeutic drug design.

Dissertation

Share this book

Add to My Shelf

A comparative study of protein–ssDNA interactions

by Guo, Jun-tao , Lin, Maoxuan , Malik, Fareeha K in DNA biosynthesis , DNA repair , DNA-binding protein

2021

Single-stranded DNA-binding proteins (SSBs) play crucial roles in DNA replication, recombination and repair, and serve as key players in the maintenance of genomic stability. While a number of SSBs bind single-stranded DNA (ssDNA) non-specifically, the others recognize and bind specific ssDNA sequences. The mechanisms underlying this binding discrepancy, however, are largely unknown. Here, we present a comparative study of protein–ssDNA interactions by annotating specific and non-specific SSBs and comparing structural features such as DNA-binding propensities and secondary structure types of residues in SSB–ssDNA interactions, protein–ssDNA hydrogen bonding and π–π interactions between specific and non-specific SSBs. Our results suggest that protein side chain-DNA base hydrogen bonds are the major contributors to protein–ssDNA binding specificity, while π–π interactions may mainly contribute to binding affinity. We also found the enrichment of aspartate in the specific SSBs, a key feature in specific protein–double-stranded DNA (dsDNA) interactions as reported in our previous study. In addition, no significant differences between specific and non-specific groups with respect of conformational changes upon ssDNA binding were found, suggesting that the flexibility of SSBs plays a lesser role than that of dsDNA-binding proteins in conferring binding specificity.

Journal Article

Share this book

Add to My Shelf

Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies

by Vandehaar, Peter , Dey, Rounak , Nielsen, Jonas B in Computer applications , Genome-wide association studies , Genomes

2018

In genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, linear mixed model and the recently proposed logistic mixed model, perform poorly -- producing large type I error rates -- in the analysis of phenotypes with unbalanced case-control ratios. Here we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation (SPA) to calibrate the distribution of score test statistics. This method, SAIGE, provides accurate p-values even when case-control ratios are extremely unbalanced. It utilizes state-of-art optimization strategies to reduce computational time and memory cost of generalized mixed model. The computation cost linearly depends on sample size, and hence can be applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK-Biobank data of 408,961 white British European-ancestry samples, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness. Footnotes * Numerical stability and convergence for the numerical and asymptotic approximations that we use to achieve the computational scalability have been evaluated and the details are now added to the supplementary material. We have added more detailed derivation of the algorithm and a discussion on the heritability estimation.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter