Catalogue Search | MBRL

Assessing and mitigating batch effects in large-scale omics studies

by Yu, Ying , Zheng, Yuanting , Mai, Yuanbang in Algorithms , Animal Genetics and Genomics , Bioinformatics

2024

Batch effects in omics data are notoriously common technical variations unrelated to study objectives, and may result in misleading outcomes if uncorrected, or hinder biomedical discovery if over-corrected. Assessing and mitigating batch effects is crucial for ensuring the reliability and reproducibility of omics data and minimizing the impact of technical variations on biological interpretation. In this review, we highlight the profound negative impact of batch effects and the urgent need to address this challenging problem in large-scale omics studies. We summarize potential sources of batch effects, current progress in evaluating and correcting them, and consortium efforts aiming to tackle them.

Journal Article

Share this book

Add to My Shelf

Optimized CRISPR guide RNA design for two high-fidelity Cas9 variants by deep learning

by Wang, Bei , Wang, Yongming , Zhou, Yan in 13/1 , 13/109 , 42/41

2019

Highly specific Cas9 nucleases derived from SpCas9 are valuable tools for genome editing, but their wide applications are hampered by a lack of knowledge governing guide RNA (gRNA) activity. Here, we perform a genome-scale screen to measure gRNA activity for two highly specific SpCas9 variants (eSpCas9(1.1) and SpCas9-HF1) and wild-type SpCas9 (WT-SpCas9) in human cells, and obtain indel rates of over 50,000 gRNAs for each nuclease, covering ~20,000 genes. We evaluate the contribution of 1,031 features to gRNA activity and develope models for activity prediction. Our data reveals that a combination of RNN with important biological features outperforms other models for activity prediction. We further demonstrate that our model outperforms other popular gRNA design tools. Finally, we develop an online design tool DeepHF for the three Cas9 nucleases. The database, as well as the designer tool, is freely accessible via a web server, http://www.DeepHF.com/ . Application of highly specific Cas9 variants can be restricted by the design of the guide RNA. Here the authors present DeepHF, a gRNA activity prediction tool built from genome-scale screens of 50,000 guides covering 20,000 genes.

Journal Article

Share this book

Add to My Shelf

A Comprehensive Mouse Transcriptomic BodyMap across 17 Tissues by RNA-seq

by Gondo, Yoichi , Qing, Tao , Li, Bin in 631/114/2402 , 631/114/2403 , 631/114/2404

2017

The mouse has been widely used as a model organism for studying human diseases and for evaluating drug safety and efficacy. Many diseases and drug effects exhibit tissue specificity that may be reflected by tissue-specific gene-expression profiles. Here we construct a comprehensive mouse transcriptomic BodyMap across 17 tissues of six-weeks old C57BL/6JJcl mice using RNA-seq. We find different expression patterns between protein-coding and non-coding genes. Liver expressed the least complex transcriptomes, that is, the smallest number of genes detected in liver across all 17 tissues, whereas testis and ovary harbor more complex transcriptomes than other tissues. We report a comprehensive list of tissue-specific genes across 17 tissues, along with a list of 4,781 housekeeping genes in mouse. In addition, we propose a list of 27 consistently and highly expressed genes that can be used as reference controls in expression-profiling analysis. Our study provides a unique resource of mouse gene-expression profiles, which is helpful for further biomedical research.

Journal Article

Share this book

Add to My Shelf

Correcting batch effects in large-scale multiomics studies using a reference-material-based ratio method

by Yang, Jingcheng , Chen, Qingwang , Hong, Huixiao in Algorithms , Animal Genetics and Genomics , Base Composition

2023

Background Batch effects are notoriously common technical variations in multiomics data and may result in misleading outcomes if uncorrected or over-corrected. A plethora of batch-effect correction algorithms are proposed to facilitate data integration. However, their respective advantages and limitations are not adequately assessed in terms of omics types, the performance metrics, and the application scenarios. Results As part of the Quartet Project for quality control and data integration of multiomics profiling, we comprehensively assess the performance of seven batch effect correction algorithms based on different performance metrics of clinical relevance, i.e., the accuracy of identifying differentially expressed features, the robustness of predictive models, and the ability of accurately clustering cross-batch samples into their own donors. The ratio-based method, i.e., by scaling absolute feature values of study samples relative to those of concurrently profiled reference material(s), is found to be much more effective and broadly applicable than others, especially when batch effects are completely confounded with biological factors of study interests. We further provide practical guidelines for implementing the ratio based approach in increasingly large-scale multiomics studies. Conclusions Multiomics measurements are prone to batch effects, which can be effectively corrected using ratio-based scaling of the multiomics data. Our study lays the foundation for eliminating batch effects at a ratio scale.

Journal Article

Share this book

Add to My Shelf

AI-powered omics-based drug pair discovery for pyroptosis therapy targeting triple-negative breast cancer

by Shen, Shun , Chen, Qingwang , Su, Xiaomin in 13/1 , 13/31 , 14/19

2024

Due to low success rates and long cycles of traditional drug development, the clinical tendency is to apply omics techniques to reveal patient-level disease characteristics and individualized responses to treatment. However, the heterogeneous form of data and uneven distribution of targets make drug discovery and precision medicine a non-trivial task. This study takes pyroptosis therapy for triple-negative breast cancer (TNBC) as a paradigm and uses data mining of a large TNBC cohort and drug databases to establish a biofactor-regulated neural network for rapidly screening and optimizing compound pyroptosis drug pairs. Subsequently, biomimetic nanococrystals are prepared using the preferred combination of mitoxantrone and gambogic acid for rational drug delivery. The unique mechanism of obtained nanococrystals regulating pyroptosis genes through ribosomal stress and triggering pyroptosis cascade immune effects are revealed in TNBC models. In this work, a target omics-based intelligent compound drug discovery framework explores an innovative drug development paradigm, which repurposes existing drugs and enables precise treatment of refractory diseases. Cancer-targeted drug discovery can be achieved by transcriptomics screening on patients. Here this group reports a drug target screening model built upon triple-negative breast cancer (TNBC) cohort and drug database with the selected drug pair exhibiting effective pyroptosis induction and TNBC tumor growth inhibition.

Journal Article

Share this book

Add to My Shelf

High quality 3C de novo assembly and annotation of a multidrug resistant ST-111 Pseudomonas aeruginosa genome: Benchmark of hybrid and non-hybrid assemblers

by Campos-Sánchez, Rebeca , García, Fernando , Molina-Mora, José Arturo in 631/114 , 631/114/2785 , 631/114/2785/2302

2020

Genotyping methods and genome sequencing are indispensable to reveal genomic structure of bacterial species displaying high level of genome plasticity. However, reconstruction of genome or assembly is not straightforward due to data complexity, including repeats, mobile and accessory genetic elements of bacterial genomes. Moreover, since the solution to this problem is strongly influenced by sequencing technology, bioinformatics pipelines, and selection criteria to assess assemblers, there is no systematic way to select a priori the optimal assembler and parameter settings. To assembly the genome of Pseudomonas aeruginosa strain AG1 (PaeAG1), short reads (Illumina) and long reads (Oxford Nanopore) sequencing data were used in 13 different non-hybrid and hybrid approaches. PaeAG1 is a multiresistant high-risk sequence type 111 (ST-111) clone that was isolated from a Costa Rican hospital and it was the first report of an isolate of P. aeruginosa carrying both blaVIM-2 and blaIMP-18 genes encoding for metallo-β-lactamases (MBL) enzymes. To assess the assemblies, multiple metrics regard to contiguity, correctness and completeness (3C criterion, as we define here) were used for benchmarking the 13 approaches and select a definitive assembly. In addition, annotation was done to identify genes (coding and RNA regions) and to describe the genomic content of PaeAG1. Whereas long reads and hybrid approaches showed better performances in terms of contiguity, higher correctness and completeness metrics were obtained for short read only and hybrid approaches. A manually curated and polished hybrid assembly gave rise to a single circular sequence with 100% of core genes and known regions identified, >98% of reads mapped back, no gaps, and uniform coverage. The strategy followed to obtain this high-quality 3C assembly is detailed in the manuscript and we provide readers with an all-in-one script to replicate our results or to apply it to other troublesome cases. The final 3C assembly revealed that the PaeAG1 genome has 7,190,208 bp, a 65.7% GC content and 6,709 genes (6,620 coding sequences), many of which are included in multiple mobile genomic elements, such as 57 genomic islands, six prophages, and two complete integrons with blaVIM-2 and blaIMP-18 MBL genes. Up to 250 and 60 of the predicted genes are anticipated to play a role in virulence (adherence, quorum sensing and secretion) or antibiotic resistance (β-lactamases, efflux pumps, etc). Altogether, the assembly and annotation of the PaeAG1 genome provide new perspectives to continue studying the genomic diversity and gene content of this important human pathogen.

Journal Article

Share this book

Add to My Shelf

Genomic and immune profiling of pre-invasive lung adenocarcinoma

by Bao, Ding , Freeman, Samuel S. , Zhang, Yawei in 45/23 , 45/91 , 631/208

2019

Adenocarcinoma in situ and minimally invasive adenocarcinoma are the pre-invasive forms of lung adenocarcinoma. The genomic and immune profiles of these lesions are poorly understood. Here we report exome and transcriptome sequencing of 98 lung adenocarcinoma precursor lesions and 99 invasive adenocarcinomas. We have identified EGFR , RBM10 , BRAF , ERBB2 , TP53 , KRAS , MAP2K1 and MET as significantly mutated genes in the pre/minimally invasive group. Classes of genome alterations that increase in frequency during the progression to malignancy are revealed. These include mutations in TP53 , arm-level copy number alterations, and HLA loss of heterozygosity. Immune infiltration is correlated with copy number alterations of chromosome arm 6p, suggesting a link between arm-level events and the tumor immune environment. The genomic and immune landscape of pre-invasive lung adenocarcinoma is poorly understood. Here, the authors perform exome and transcriptome sequencing on precursor legions and invasive lung adenocarcinomas, identifying recurrently mutated genes in pre/minimally invasive cases, and arm level alteration events linked to immune infiltration.

Journal Article

Share this book

Add to My Shelf

Prediction of base editor off-targets by deep learning

by Yang, Jingcheng , Wang, Yongming , Yang, Yuan in 49/23 , 631/114/1305 , 631/208/4041/3196

2023

Due to the tolerance of mismatches between gRNA and targeting sequence, base editors frequently induce unwanted Cas9-dependent off-target mutations. Here, to develop models to predict such off-targets, we design gRNA-off- target pairs for adenine base editors (ABEs) and cytosine base editors (CBEs) and stably integrate them into the human cells. After five days of editing, we obtain valid efficiency datasets of 54,663 and 55,727 off-targets for ABEs and CBEs, respectively. We use the datasets to train deep learning models, resulting in ABEdeepoff and CBEdeepoff, which can predict off-target sites. We use these tools to predict off-targets for a panel of endogenous loci and achieve Spearman correlation values varying from 0.710 to 0.859. Finally, we develop an integrated tool that is freely accessible via an online web server http://www.deephf.com/#/bedeep/bedeepoff . These tools could facilitate minimizing the off-target effects of base editing. Base editors can induce unwanted off-target effects. Here the authors design libraries of gRNA-off-target pairs and perform a screen to obtain editing efficiencies for ABE and CBE: they use the datasets to train DL models (ABEdeepoff and CBEdeepoff) which can predict mutation tolerance at potential off-targets.

Journal Article

Share this book

Add to My Shelf

A real-world multi-center RNA-seq benchmarking study using the Quartet and MAQC reference materials

by Chen, Qingwang , Liu, Cong , Shi, Leming in 38/91 , 631/61/514/1949 , 692/700/139/1420

2024

Translating RNA-seq into clinical diagnostics requires ensuring the reliability and cross-laboratory consistency of detecting clinically relevant subtle differential expressions, such as those between different disease subtypes or stages. As part of the Quartet project, we present an RNA-seq benchmarking study across 45 laboratories using the Quartet and MAQC reference samples spiked with ERCC controls. Based on multiple types of ‘ground truth’, we systematically assess the real-world RNA-seq performance and investigate the influencing factors involved in 26 experimental processes and 140 bioinformatics pipelines. Here we show greater inter-laboratory variations in detecting subtle differential expressions among the Quartet samples. Experimental factors including mRNA enrichment and strandedness, and each bioinformatics step, emerge as primary sources of variations in gene expression. We underscore the profound influence of experimental execution, and provide best practice recommendations for experimental designs, strategies for filtering low-expression genes, and the optimal gene annotation and analysis pipelines. In summary, this study lays the foundation for developing and quality control of RNA-seq for clinical diagnostic purposes. Here the authors report on an RNA-seq benchmarking study that demonstrates greater inter-lab variations in detecting subtle differential expression. The study reveals the impact of experimental execution, experimental designs, low-expression gene filtering, and analysis tool selection.

Journal Article

Share this book

Add to My Shelf

Similarities and differences between variants called with human reference genome HG19 or HG38

by Liu, Zhichao , Guo, Wenjing , Hong, Huixiao in Algorithms , Alleles , BASIC BIOLOGICAL SCIENCES

2019

Background Reference genome selection is a prerequisite for successful analysis of next generation sequencing (NGS) data. Current practice employs one of the two most recent human reference genome versions: HG19 or HG38. To date, the impact of genome version on SNV identification has not been rigorously assessed. Methods We conducted analysis comparing the SNVs identified based on HG19 vs HG38, leveraging whole genome sequencing (WGS) data from the genome-in-a-bottle (GIAB) project. First, SNVs were called using 26 different bioinformatics pipelines with either HG19 or HG38. Next, two tools were used to convert the called SNVs between HG19 and HG38. Lastly we calculated conversion rates, analyzed discordant rates between SNVs called with HG19 or HG38, and characterized the discordant SNVs. Results The conversion rates from HG38 to HG19 (average 95%) were lower than the conversion rates from HG19 to HG38 (average 99%). The conversion rates varied slightly among the various calling pipelines. Around 1.5% SNVs were discordantly converted between HG19 or HG38. The conversions from HG38 to HG19 had more SNVs which failed conversion and more discordant SNVs than the opposite conversion (HG19 to HG38). Most of the discordant SNVs had low read depth, were low confidence SNVs as defined by GIAB, and/or were predominated by G/C alleles (52% observed versus 42% expected). Conclusion A significant number of SNVs could not be converted between HG19 and HG38. Based on careful review of our comparisons, we recommend HG38 (the newer version) for NGS SNV analysis. To summarize, our findings suggest caution when translating identified SNVs between different versions of the human reference genome.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter