Catalogue Search | MBRL

Applications of machine learning in drug discovery and development

by Zhao Shanrong , Dunham, Ian , Li, Bin in Algorithms , Bioinformatics , Biomarkers

2019

Drug discovery and development pipelines are long, complex and depend on numerous factors. Machine learning (ML) approaches provide a set of tools that can improve discovery and decision making for well-specified questions with abundant, high-quality data. Opportunities to apply ML occur in all stages of drug discovery. Examples include target validation, identification of prognostic biomarkers and analysis of digital pathology data in clinical trials. Applications have ranged in context and methodology, with some approaches yielding accurate predictions and insights. The challenges of applying ML lie primarily with the lack of interpretability and repeatability of ML-generated results, which may limit their application. In all areas, systematic and comprehensive high-dimensional data still need to be generated. With ongoing efforts to tackle these issues, as well as increasing awareness of the factors needed to validate ML approaches, the application of ML can promote data-driven decision making and has the potential to speed up the process and reduce failure rates in drug discovery and development.Machine learning has been applied to numerous stages in the drug discovery pipeline. Here, Vamathevan and colleagues discuss the most useful techniques and how machine learning can promote data-driven decision making in drug discovery and development. They highlight major hurdles in the field, such as the required data characteristics for applying machine learning, which will need to be solved as machine learning matures.

Journal Article

Share this book

Add to My Shelf

Single-cell analyses of Crohn’s disease tissues reveal intestinal intraepithelial T cells heterogeneity and altered subset distributions

by Bugatti, Mattia , Wynn, Thomas A. , Di Luccia, Blanda in 13/1 , 13/106 , 13/21

2021

Crohn’s disease (CD) is a chronic transmural inflammation of intestinal segments caused by dysregulated interaction between microbiome and gut immune system. Here, we profile, via multiple single-cell technologies, T cells purified from the intestinal epithelium and lamina propria (LP) from terminal ileum resections of adult severe CD cases. We find that intraepithelial lymphocytes (IEL) contain several unique T cell subsets, including NKp30 + γδT cells expressing RORγt and producing IL-26 upon NKp30 engagement. Further analyses comparing tissues from non-inflamed and inflamed regions of patients with CD versus healthy controls show increased activated T H 17 but decreased CD8 + T, γδT, T FH and Treg cells in inflamed tissues. Similar analyses of LP find increased CD8 + , as well as reduced CD4 + T cells with an elevated T H 17 over Treg/T FH ratio. Our analyses of CD tissues thus suggest a potential link, pending additional validations, between transmural inflammation, reduced IEL γδT cells and altered spatial distribution of IEL and LP T cell subsets. Crohn’s disease results from transmural inflammation in the gut, but analyses of local immune populations are still lacking. Here, the authors show, by combining multiple single-cell approaches, that intraepithelial and lamina propria T cells are heterogenous, show unique phenotypes, and exhibit altered subsets upon inflammation.

Journal Article

Share this book

Add to My Shelf

Assessment of the Impact of Using a Reference Transcriptome in Mapping Short RNA-Seq Reads

by Zhao, Shanrong in Algorithms , Alignment , Analysis

2014

RNA-Seq has become increasingly popular in transcriptome profiling. The major challenge in RNA-Seq data analysis is the accurate mapping of junction reads to their genomic origins. To detect splicing sites in short reads, many RNA-Seq aligners use reference transcriptome to inform placement of junction reads. However, no systematic evaluation has been performed to assess or quantify the benefits of incorporating reference transcriptome in mapping RNA-Seq reads. In this paper, we have studied the impact of reference transcriptome on mapping RNA-Seq reads, especially on junction ones. The same dataset were analysed with and without RefGene transcriptome, respectively. Then a Perl script was developed to analyse and compare the mapping results. It was found that about 50-55% junction reads can be mapped to the same genomic regions regardless of the usage of RefGene model. More than one-third of reads fail to be mapped without the help of a reference transcriptome. For \"Alternatively\" mapped reads, i.e., those reads mapped differently with and without RefGene model, the mappings without RefGene model are usually worse than their corresponding alignments with RefGene model. For junction reads that span more than two exons, it is less likely to align them correctly without the assistance of reference transcriptome. As the sequencing technology evolves, the read length is becoming longer and longer. When reads become longer, they are more likely to span multiple exons, and thus the mapping of long junction reads is actually becoming more and more challenging without the assistance of reference transcriptome. Therefore, the advantages of using reference transcriptome in the mapping demonstrated in this study are becoming more evident for longer reads. In addition, the effect of the completeness of reference transcriptome on mapping of RNA-Seq reads is discussed.

Journal Article

Share this book

Add to My Shelf

Comparison of RNA-Seq and Microarray in Transcriptome Profiling of Activated T Cells

by Fung-Leung, Wai-Ping , Bittner, Anton , Ngo, Karen in Accuracy , Analysis of Variance , Annotations

2014

To demonstrate the benefits of RNA-Seq over microarray in transcriptome profiling, both RNA-Seq and microarray analyses were performed on RNA samples from a human T cell activation experiment. In contrast to other reports, our analyses focused on the difference, rather than similarity, between RNA-Seq and microarray technologies in transcriptome profiling. A comparison of data sets derived from RNA-Seq and Affymetrix platforms using the same set of samples showed a high correlation between gene expression profiles generated by the two platforms. However, it also demonstrated that RNA-Seq was superior in detecting low abundance transcripts, differentiating biologically critical isoforms, and allowing the identification of genetic variants. RNA-Seq also demonstrated a broader dynamic range than microarray, which allowed for the detection of more differentially expressed genes with higher fold-change. Analysis of the two datasets also showed the benefit derived from avoidance of technical issues inherent to microarray probe performance such as cross-hybridization, non-specific hybridization and limited detection range of individual probes. Because RNA-Seq does not rely on a pre-designed complement sequence detection probe, it is devoid of issues associated with probe redundancy and annotation, which simplified interpretation of the data. Despite the superior benefits of RNA-Seq, microarrays are still the more common choice of researchers when conducting transcriptional profiling experiments. This is likely because RNA-Seq sequencing technology is new to most researchers, more expensive than microarray, data storage is more challenging and analysis is more complex. We expect that once these barriers are overcome, the RNA-Seq platform will become the predominant tool for transcriptome analysis.

Journal Article

Share this book

Add to My Shelf

Evaluation of two main RNA-seq approaches for gene quantification in clinical RNA sequencing: polyA+ selection versus rRNA depletion

by von Schack, David , Zhang, Ying , Zhang, Baohong in 631/114/2163 , 631/114/2785 , 631/1647/48

2018

To allow efficient transcript/gene detection, highly abundant ribosomal RNAs (rRNA) are generally removed from total RNA either by positive polyA+ selection or by rRNA depletion (negative selection) before sequencing. Comparisons between the two methods have been carried out by various groups, but the assessments have relied largely on non-clinical samples. In this study, we evaluated these two RNA sequencing approaches using human blood and colon tissue samples. Our analyses showed that rRNA depletion captured more unique transcriptome features, whereas polyA+ selection outperformed rRNA depletion with higher exonic coverage and better accuracy of gene quantification. For blood- and colon-derived RNAs, we found that 220% and 50% more reads, respectively, would have to be sequenced to achieve the same level of exonic coverage in the rRNA depletion method compared with the polyA+ selection method. Therefore, in most cases we strongly recommend polyA+ selection over rRNA depletion for gene quantification in clinical RNA sequencing. Our evaluation revealed that a small number of lncRNAs and small RNAs made up a large fraction of the reads in the rRNA depletion RNA sequencing data. Thus, we recommend that these RNAs are specifically depleted to improve the sequencing depth of the remaining RNAs.

Journal Article

Share this book

Add to My Shelf

Evaluation and comparison of computational tools for RNA-seq isoform quantification

by Zhang, Chi , Zhang, Baohong , Zhao, Shanrong in Accuracy , Algorithms , Alternative splicing

2017

Background Alternatively spliced transcript isoforms are commonly observed in higher eukaryotes. The expression levels of these isoforms are key for understanding normal functions in healthy tissues and the progression of disease states. However, accurate quantification of expression at the transcript level is limited with current RNA-seq technologies because of, for example, limited read length and the cost of deep sequencing. Results A large number of tools have been developed to tackle this problem, and we performed a comprehensive evaluation of these tools using both experimental and simulated RNA-seq datasets. We found that recently developed alignment-free tools are both fast and accurate. The accuracy of all methods was mainly influenced by the complexity of gene structures and caution must be taken when interpreting quantification results for short transcripts. Using TP53 gene simulation, we discovered that both sequencing depth and the relative abundance of different isoforms affect quantification accuracy Conclusions Our comprehensive evaluation helps data analysts to make informed choice when selecting computational tools for isoform quantification.

Journal Article

Share this book

Add to My Shelf

Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data

by Ziemek, Daniel , Henstock, Peter , Mu, Xinmeng Jasmine in Algorithms , Atopic dermatitis , Benchmarking

2020

Background The ability to confidently predict health outcomes from gene expression would catalyze a revolution in molecular diagnostics. Yet, the goal of developing actionable, robust, and reproducible predictive signatures of phenotypes such as clinical outcome has not been attained in almost any disease area. Here, we report a comprehensive analysis spanning prediction tasks from ulcerative colitis, atopic dermatitis, diabetes, to many cancer subtypes for a total of 24 binary and multiclass prediction problems and 26 survival analysis tasks. We systematically investigate the influence of gene subsets, normalization methods and prediction algorithms. Crucially, we also explore the novel use of deep representation learning methods on large transcriptomics compendia, such as GTEx and TCGA, to boost the performance of state-of-the-art methods. The resources and findings in this work should serve as both an up-to-date reference on attainable performance, and as a benchmarking resource for further research. Results Approaches that combine large numbers of genes outperformed single gene methods consistently and with a significant margin, but neither unsupervised nor semi-supervised representation learning techniques yielded consistent improvements in out-of-sample performance across datasets. Our findings suggest that using l 2 -regularized regression methods applied to centered log-ratio transformed transcript abundances provide the best predictive analyses overall. Conclusions Transcriptomics-based phenotype prediction benefits from proper normalization techniques and state-of-the-art regularized regression approaches. In our view, breakthrough performance is likely contingent on factors which are independent of normalization and general modeling techniques; these factors might include reduction of systematic errors in sequencing data, incorporation of other data types such as single-cell sequencing and proteomics, and improved use of prior knowledge.

Journal Article

Share this book

Add to My Shelf

RORγt and RORα signature genes in human Th17 cells

by Fung-Leung, Wai-Ping , Blevitt, Jonathan , Zhao, Shanrong in Arthritis , Autoimmune diseases , Biology and life sciences

2017

RORγt and RORα are transcription factors of the RAR-related orphan nuclear receptor (ROR) family. They are expressed in Th17 cells and have been suggested to play a role in Th17 differentiation. Although RORγt signature genes have been characterized in mouse Th17 cells, detailed information on its transcriptional control in human Th17 cells is limited and even less is known about RORα signature genes which have not been reported in either human or mouse T cells. In this study, global gene expression of human CD4 T cells activated under Th17 skewing conditions was profiled by RNA sequencing. RORγt and RORα signature genes were identified in these Th17 cells treated with specific siRNAs to knock down RORγt or RORα expression. We have generated selective small molecule RORγt modulators and they were also utilized as pharmacological tools in RORγt signature gene identification. Our results showed that RORγt controlled the expression of a very selective number of genes in Th17 cells and most of them were regulated by RORα as well albeit a weaker influence. Key Th17 genes including IL-17A, IL-17F, IL-23R, CCL20 and CCR6 were shown to be regulated by both RORγt and RORα. Our results demonstrated an overlapping role of RORγt and RORα in human Th17 cell differentiation through regulation of a defined common set of Th17 genes. RORγt as a drug target for treatment of Th17 mediated autoimmune diseases such as psoriasis has been demonstrated recently in clinical trials. Our results suggest that RORα could be involved in same disease mechanisms and gene signatures identified in this report could be valuable biomarkers for tracking the pharmacodynamic effects of compounds that modulate RORγt or RORα activities in patients.

Journal Article

Share this book

Add to My Shelf

Comparison of stranded and non-stranded RNA-seq transcriptome profiling and investigation of gene overlap

by von Schack, David , Gordon, William , Xi, Hualin in Animal Genetics and Genomics , Biomedical and Life Sciences , Blood

2015

Background While RNA-sequencing (RNA-seq) is becoming a powerful technology in transcriptome profiling, one significant shortcoming of the first-generation RNA-seq protocol is that it does not retain the strand specificity of origin for each transcript. Without strand information it is difficult and sometimes impossible to accurately quantify gene expression levels for genes with overlapping genomic loci that are transcribed from opposite strands. It has recently become possible to retain the strand information by modifying the RNA-seq protocol, known as strand-specific or stranded RNA-seq. Here, we evaluated the advantages of stranded RNA-seq in transcriptome profiling of whole blood RNA samples compared with non-stranded RNA-seq, and investigated the influence of gene overlaps on gene expression profiling results based on practical RNA-seq datasets and also from a theoretical perspective. Results Our results demonstrated a substantial impact of stranded RNA-seq on transcriptome profiling and gene expression measurements. As many as 1751 genes in Gencode Release 19 were identified to be differentially expressed when comparing stranded and non-stranded RNA-seq whole blood samples. Antisense and pseudogenes were significantly enriched in differential expression analyses. Because stranded RNA-seq retains strand information of a read, we can resolve read ambiguity in overlapping genes transcribed from opposite strands, which provides a more accurate quantification of gene expression levels compared with traditional non-stranded RNA-seq. In the human genome, it is not uncommon to find genomic loci where both strands encode distinct genes. Among the over 57,800 annotated genes in Gencode release 19, there are an estimated 19 % (about 11,000) of overlapping genes transcribed from the opposite strands. Based on our whole blood mRNA-seq datasets, the fraction of overlapping nucleotide bases on the same and opposite strands were estimated at 2.94 % and 3.1 %, respectively. The corresponding theoretical estimations are 3 % and 3.6 %, well in agreement with our own findings. Conclusions Stranded RNA-seq provides a more accurate estimate of transcript expression compared with non-stranded RNA-seq, and is therefore the recommended RNA-seq approach for future mRNA-seq studies.

Journal Article

Share this book

Add to My Shelf

Union Exon Based Approach for RNA-Seq Gene Quantification: To Be or Not to Be?

by Xi, Li , Zhao, Shanrong , Zhang, Baohong in Accuracy , Algorithms , Annotations

2015

In recent years, RNA-seq is emerging as a powerful technology in estimation of gene and/or transcript expression, and RPKM (Reads Per Kilobase per Million reads) is widely used to represent the relative abundance of mRNAs for a gene. In general, the methods for gene quantification can be largely divided into two categories: transcript-based approach and 'union exon'-based approach. Transcript-based approach is intrinsically more difficult because different isoforms of the gene typically have a high proportion of genomic overlap. On the other hand, 'union exon'-based approach method is much simpler and thus widely used in RNA-seq gene quantification. Biologically, a gene is expressed in one or more transcript isoforms. Therefore, transcript-based approach is logistically more meaningful than 'union exon'-based approach. Despite the fact that gene quantification is a fundamental task in most RNA-seq studies, however, it remains unclear whether 'union exon'-based approach for RNA-seq gene quantification is a good practice or not. In this paper, we carried out a side-by-side comparison of 'union exon'-based approach and transcript-based method in RNA-seq gene quantification. It was found that the gene expression levels are significantly underestimated by 'union exon'-based approach, and the average of RPKM from 'union exons'-based method is less than 50% of the mean expression obtained from transcript-based approach. The difference between the two approaches is primarily affected by the number of transcripts in a gene. We performed differential analysis at both gene and transcript levels, respectively, and found more insights, such as isoform switches, are gained from isoform differential analysis. The accuracy of isoform quantification would improve if the read coverage pattern and exon-exon spanning reads are taken into account and incorporated into EM (Expectation Maximization) algorithm. Our investigation discourages the use of 'union exons'-based approach in gene quantification despite its simplicity.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter