Catalogue Search | MBRL

CADD-Splice—improving genome-wide variant effect prediction using deep learning-derived splice scores

by Kircher, Martin , Shendure, Jay , Rentzsch, Philipp in Alternative splicing , Analysis , Base Sequence

2021

Background Splicing of genomic exons into mRNAs is a critical prerequisite for the accurate synthesis of human proteins. Genetic variants impacting splicing underlie a substantial proportion of genetic disease, but are challenging to identify beyond those occurring at donor and acceptor dinucleotides. To address this, various methods aim to predict variant effects on splicing. Recently, deep neural networks (DNNs) have been shown to achieve better results in predicting splice variants than other strategies. Methods It has been unclear how best to integrate such process-specific scores into genome-wide variant effect predictors. Here, we use a recently published experimental data set to compare several machine learning methods that score variant effects on splicing. We integrate the best of those approaches into general variant effect prediction models and observe the effect on classification of known pathogenic variants. Results We integrate two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu ), a widely used tool for genome-wide variant effect prediction that we previously developed to weight and integrate diverse collections of genomic annotations. With this new model, CADD-Splice, we show that inclusion of splicing DNN effect scores substantially improves predictions across multiple variant categories, without compromising overall performance. Conclusions While splice effect scores show superior performance on splice variants, specialized predictors cannot compete with other variant scores in general variant interpretation, as the latter account for nonsense and missense effects that do not alter splicing. Although only shown here for splice scores, we believe that the applied approach will generalize to other specific molecular processes, providing a path for the further improvement of genome-wide variant effect prediction.

Journal Article

Share this book

Add to My Shelf

A systematic evaluation of the design and context dependencies of massively parallel reporter assays

by Kircher, Martin , Inoue, Fumitaka , Shendure, Jay in 631/208/200 , 631/208/212 , 631/208/212/2019

2020

Massively parallel reporter assays (MPRAs) functionally screen thousands of sequences for regulatory activity in parallel. To date, there are limited studies that systematically compare differences in MPRA design. Here, we screen a library of 2,440 candidate liver enhancers and controls for regulatory activity in HepG2 cells using nine different MPRA designs. We identify subtle but significant differences that correlate with epigenetic and sequence-level features, as well as differences in dynamic range and reproducibility. We also validate that enhancer activity is largely independent of orientation, at least for our library and designs. Finally, we assemble and test the same enhancers as 192-mers, 354-mers and 678-mers and observe sizable differences. This work provides a framework for the experimental design of high-throughput reporter assays, suggesting that the extended sequence context of tested elements and to a lesser degree the precise assay, influence MPRA results. Massively parallel reporter assays (MPRAs) enable high-throughput assessments of regulatory elements in single experiments. This work compares nine MPRA designs and reports how differences in reporter assays influence the results of MPRAs.

Journal Article

Share this book

Add to My Shelf

Addressing challenges in the production and analysis of illumina sequencing data

by Heyn, Patricia , Kircher, Martin , Kelso, Janet in Animal Genetics and Genomics , Biomedical and Life Sciences , Correspondence

2011

Advances in DNA sequencing technologies have made it possible to generate large amounts of sequence data very rapidly and at substantially lower cost than capillary sequencing. These new technologies have specific characteristics and limitations that require either consideration during project design, or which must be addressed during data analysis. Specialist skills, both at the laboratory and the computational stages of project design and analysis, are crucial to the generation of high quality data from these new platforms. The Illumina sequencers (including the Genome Analyzers I/II/IIe/IIx and the new HiScan and HiSeq) represent a widely used platform providing parallel readout of several hundred million immobilized sequences using fluorescent-dye reversible-terminator chemistry. Sequencing library quality, sample handling, instrument settings and sequencing chemistry have a strong impact on sequencing run quality. The presence of adapter chimeras and adapter sequences at the end of short-insert molecules, as well as increased error rates and short read lengths complicate many computational analyses. We discuss here some of the factors that influence the frequency and severity of these problems and provide solutions for circumventing these. Further, we present a set of general principles for good analysis practice that enable problems with sequencing runs to be identified and dealt with.

Journal Article

Share this book

Add to My Shelf

Characterization of ancient and modern genomes by SNP detection and phylogenomic and metagenomic analysis using PALEOMIX

by Kircher, Martin , Schaefer, Robert , McCue, Molly in 631/1647/48 , 631/181/757 , 631/208/212/2142

2014

Next-generation sequencing technologies have revolutionized the field of paleogenomics, allowing the reconstruction of complete ancient genomes and their comparison with modern references. However, this requires the processing of vast amounts of data and involves a large number of steps that use a variety of computational tools. Here we present PALEOMIX ( http://geogenetics.ku.dk/publications/paleomix ), a flexible and user-friendly pipeline applicable to both modern and ancient genomes, which largely automates the in silico analyses behind whole-genome resequencing. Starting with next-generation sequencing reads, PALEOMIX carries out adapter removal, mapping against reference genomes, PCR duplicate removal, characterization of and compensation for postmortem damage, SNP calling and maximum-likelihood phylogenomic inference, and it profiles the metagenomic contents of the samples. As such, PALEOMIX allows for a series of potential applications in paleogenomics, comparative genomics and metagenomics. Applying the PALEOMIX pipeline to the three ancient and seven modern Phytophthora infestans genomes as described here takes 5 d using a 16-core server.

Journal Article

Share this book

Add to My Shelf

Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution

by Kircher, Martin , Inoue, Fumitaka , Shendure, Jay in 38/109 , 38/44 , 38/47

2019

The majority of common variants associated with common diseases, as well as an unknown proportion of causal mutations for rare diseases, fall in noncoding regions of the genome. Although catalogs of noncoding regulatory elements are steadily improving, we have a limited understanding of the functional effects of mutations within them. Here, we perform saturation mutagenesis in conjunction with massively parallel reporter assays on 20 disease-associated gene promoters and enhancers, generating functional measurements for over 30,000 single nucleotide substitutions and deletions. We find that the density of putative transcription factor binding sites varies widely between regulatory elements, as does the extent to which evolutionary conservation or integrative scores predict functional effects. These data provide a powerful resource for interpreting the pathogenicity of clinically observed mutations in these disease-associated regulatory elements, and comprise a rich dataset for the further development of algorithms that aim to predict the regulatory effects of noncoding mutations. Interpreting genetic variation in the noncoding genome remains challenging, with functional effects difficult to predict. Here, the authors perform saturation mutagenesis combined with massively parallel reporter assays for 20 disease-associated regulatory elements, quantifying the effects of over 30,000 variants.

Journal Article

Share this book

Add to My Shelf

varCADD: large sets of standing genetic variation enable genome-wide pathogenicity prediction

by Nazaretyan, Lusiné , Rentzsch, Philipp , Kircher, Martin in Annotations , Anopheles , Artificial intelligence

2025

Background Machine learning and artificial intelligence are increasingly being applied to identify phenotypically causal genetic variation. These data-driven methods require comprehensive training sets to deliver reliable results. However, large unbiased datasets for variant prioritization and effect predictions are rare as most of the available databases do not represent a broad ensemble of variant effects and are often biased towards the protein-coding genome, or even towards few well-studied genes. Methods To overcome these issues, we propose several alternative training sets derived from subsets of human standing variation. Specifically, we use variants identified from whole-genome sequences of 71,156 individuals contained in gnomAD v3.0 and approximate the benign set with frequent standing variation and the deleterious set with rare or singleton variation. We apply the Combined Annotation Dependent Depletion framework (CADD) and train several alternative models using CADD v1.6. Results Using the NCBI ClinVar validation set, we demonstrate that the alternative models have state-of-the-art accuracy, globally on par with deleteriousness scores of CADD v1.6 and v1.7, but also outperforming them in certain genomic regions. Being larger than conventional training datasets, including the evolutionary-derived training dataset of about 30 million variants in CADD, standing variation datasets cover a broader range of genomic regions and rare instances of the applied annotations. For example, they cover more recent evolutionary changes common in gene regulatory regions, which are more challenging to assess with conventional tools. Conclusions Standing variation allows us to directly train state-of-the-art models for genome-wide variant prioritization or to augment evolutionary-derived variants in training. The proposed datasets have several advantages, like being substantially larger and potentially less biased. Datasets derived from standing variation represent natural allelic changes in the human genome and do not require extensive simulations and adaptations to annotations of evolutionary-derived sequence alterations used for CADD training. We provide datasets as well as trained models to the community for further development and application.

Journal Article

Share this book

Add to My Shelf

Deep proteome and transcriptome mapping of a human cancer cell line

by Kircher, Martin , Mann, Matthias , Geiger, Tamar in Base Sequence , Cancer , Cell Line, Tumor

2011

While the number and identity of proteins expressed in a single human cell type is currently unknown, this fundamental question can be addressed by advanced mass spectrometry (MS)‐based proteomics. Online liquid chromatography coupled to high‐resolution MS and MS/MS yielded 166 420 peptides with unique amino‐acid sequence from HeLa cells. These peptides identified 10 255 different human proteins encoded by 9207 human genes, providing a lower limit on the proteome in this cancer cell line. Deep transcriptome sequencing revealed transcripts for nearly all detected proteins. We calculate copy numbers for the expressed proteins and show that the abundances of >90% of them are within a factor 60 of the median protein expression level. Comparisons of the proteome and the transcriptome, and analysis of protein complex databases and GO categories, suggest that we achieved deep coverage of the functional transcriptome and the proteome of a single cell type. More than 10 000 proteins were identified by high‐resolution mass spectrometry in a human cancer cell line. The data cover most of the functional proteome as judged by RNA‐seq data and it reveals the expression range of different protein classes.

Journal Article

Share this book

Add to My Shelf

Massively parallel characterization of transcriptional regulatory elements

by Kircher, Martin , Shendure, Jay , Ahituv, Nadav in 38/91 , 45/15 , 45/23

2025

The human genome contains millions of candidate cis -regulatory elements (cCREs) with cell-type-specific activities that shape both health and many disease states 1 . However, we lack a functional understanding of the sequence features that control the activity and cell-type-specific features of these cCREs. Here we used lentivirus-based massively parallel reporter assays (lentiMPRAs) to test the regulatory activity of more than 680,000 sequences, representing an extensive set of annotated cCREs among three cell types (HepG2, K562 and WTC11), and found that 41.7% of these sequences were active. By testing sequences in both orientations, we find promoters to have strand-orientation biases and their 200-nucleotide cores to function as non-cell-type-specific ‘on switches’ that provide similar expression levels to their associated gene. By contrast, enhancers have weaker orientation biases, but increased tissue-specific characteristics. Utilizing our lentiMPRA data, we develop sequence-based models to predict cCRE function and variant effects with high accuracy, delineate regulatory motifs and model their combinatorial effects. Testing a lentiMPRA library encompassing 60,000 cCREs in all three cell types further identified factors that determine cell-type specificity. Collectively, our work provides an extensive catalogue of functional CREs in three widely used cell lines and showcases how large-scale functional measurements can be used to dissect regulatory grammar. Lentivirus-based reporter assays for 680,000 regulatory sequences from three cell lines coupled to machine-learning models lead to insights into the grammar of cis -regulatory elements.

Journal Article

Share this book

Add to My Shelf

The evolution of gene expression levels in mammalian organs

by Harrigan, Patrick , Kircher, Martin , Julien, Philippe in 631/181/2474 , 631/208/199 , 631/208/212/2019

2011

Changes in gene expression are thought to underlie many of the phenotypic differences between species. However, large-scale analyses of gene expression evolution were until recently prevented by technological limitations. Here we report the sequencing of polyadenylated RNA from six organs across ten species that represent all major mammalian lineages (placentals, marsupials and monotremes) and birds (the evolutionary outgroup), with the goal of understanding the dynamics of mammalian transcriptome evolution. We show that the rate of gene expression evolution varies among organs, lineages and chromosomes, owing to differences in selective pressures: transcriptome change was slow in nervous tissues and rapid in testes, slower in rodents than in apes and monotremes, and rapid for the X chromosome right after its formation. Although gene expression evolution in mammals was strongly shaped by purifying selection, we identify numerous potentially selectively driven expression switches, which occurred at different rates across lineages and tissues and which probably contributed to the specific organ biology of various mammals. Gene expression and species difference Genome analyses can uncover protein-coding changes that potentially underlie the differences between species, but many of the phenotypic differences between species are the result of regulatory mutations affecting gene expression. Brawand et al . use high-throughput RNA sequencing to study the evolutionary dynamics of mammalian transcriptomes in six major tissues (cortex, cerebellum, heart, kidney, liver and testis) from ten species from all major mammalian lineages. Among the findings is the extent of transcriptome variation between organs and species, as well as the identification of potentially selectively driven expression switches that may have shaped specific organ biology.

Journal Article

Share this book

Add to My Shelf

Using individual barcodes to increase quantification power of massively parallel reporter assays

by Göbel-Knapp, Angelina , Keukeleire, Pia , Kircher, Martin in Algorithms , Bar codes , Binomial distribution

2025

Background Massively parallel reporter assays (MPRAs) are an experimental technology for measuring the activity of thousands of candidate regulatory sequences or their variants in parallel, where the activity of individual sequences is measured from pools of sequence-tagged reporter genes. Activity is derived from the ratio of transcribed RNA to input DNA counts of associated tag sequences in each reporter construct, so-called barcodes. Recently, tools specifically designed to analyze MPRA data were developed that attempt to model the count data, accounting for its inherent variation. Of these tools, MPRAnalyze and mpralm are most widely used. MPRAnalyze models barcode counts to estimate the transcription rate of each sequence. While it has increased statistical power and robustness against outliers compared to mpralm, it is slow and has a high false discovery rate. Mpralm, a tool built on the R package Limma, estimates log fold-changes between different sequences. As opposed to MPRAnalyze, it is fast and has a low false discovery rate but is susceptible to outliers and has less statistical power. Results We propose BCalm, an MPRA analysis framework aimed at addressing the limitations of the existing tools. BCalm is an adaptation of mpralm, but models individual barcode counts instead of aggregating counts per sequence. Leaving out the aggregation step increases statistical power and improves robustness to outliers, while being fast and precise. We show the improved performance over existing methods on both simulated MPRA data and a lentiviral MPRA library of 166,508 target sequences, including 82,258 allelic variants. Further, BCalm adds functionality beyond the existing mpralm package, such as preparing count input files from MPRAsnakeflow, as well as an option to test for sequences with enhancing or repressing activity. Its built-in plotting functionalities allow for easy interpretation of the results. Conclusions With BCalm, we provide a new tool for analyzing MPRA data which is robust and accurate on real MPRA datasets. The package is available at https://github.com/kircherlab/BCalm .

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter