Catalogue Search | MBRL

Towards pan-genome read alignment to improve variation calling

by Norri, Tuukka , Välimäki, Niko , Pitkänen, Esa in Access to Information , Analysis , Animal Genetics and Genomics

2018

Background Typical human genome differs from the reference genome at 4-5 million sites. This diversity is increasingly catalogued in repositories such as ExAC/gnomAD, consisting of >15,000 whole-genomes and >126,000 exome sequences from different individuals. Despite this enormous diversity, resequencing data workflows are still based on a single human reference genome. Identification and genotyping of genetic variants is typically carried out on short-read data aligned to a single reference, disregarding the underlying variation. Results We propose a new unified framework for variant calling with short-read data utilizing a representation of human genetic variation – a pan-genomic reference. We provide a modular pipeline that can be seamlessly incorporated into existing sequencing data analysis workflows. Our tool is open source and available online: https://gitlab.com/dvalenzu/PanVC . Conclusions Our experiments show that by replacing a standard human reference with a pan-genomic one we achieve an improvement in single-nucleotide variant calling accuracy and in short indel calling accuracy over the widely adopted Genome Analysis Toolkit (GATK) in difficult genomic regions.

Journal Article

Share this book

Add to My Shelf

SVNN: an efficient PacBio-specific pipeline for structural variations calling using neural networks

by Hadadian Nejad Yousefi, Mostafa , Goudarzi, Maziar , Akbarinejad, Shaya in Algorithms , Bioinformatics , Biomedical and Life Sciences

2021

Background Once aligned, long-reads can be a useful source of information to identify the type and position of structural variations. However, due to the high sequencing error of long reads, long-read structural variation detection methods are far from precise in low-coverage cases. To be accurate, they need to use high-coverage data, which in turn, results in an extremely time-consuming pipeline, especially in the alignment phase. Therefore, it is of utmost importance to have a structural variation calling pipeline which is both fast and precise for low-coverage data. Results In this paper, we present SVNN, a fast yet accurate, structural variation calling pipeline for PacBio long-reads that takes raw reads as the input and detects structural variants of size larger than 50 bp. Our pipeline utilizes state-of-the-art long-read aligners, namely NGMLR and Minimap2, and structural variation callers, videlicet Sniffle and SVIM. We found that by using a neural network, we can extract features from Minimap2 output to detect a subset of reads that provide useful information for structural variation detection. By only mapping this subset with NGMLR, which is far slower than Minimap2 but better serves downstream structural variation detection, we can increase the sensitivity in an efficient way. As a result of using multiple tools intelligently, SVNN achieves up to 20 percentage points of sensitivity improvement in comparison with state-of-the-art methods and is three times faster than a naive combination of state-of-the-art tools to achieve almost the same accuracy. Conclusion Since prohibitive costs of using high-coverage data have impeded long-read applications, with SVNN, we provide the users with a much faster structural variation detection platform for PacBio reads with high precision and sensitivity in low-coverage scenarios.

Journal Article

Share this book

Add to My Shelf

Development and validation of a pharmacogenomics reporting workflow based on the illumina global screening array chip

by Dela, Fitria , Pitaloka, Tessalonika Damaris Ayu , Irwanto, Astrid in Accuracy , Alleles , Cell lines

2024

Background: Microarrays are a well-established and widely adopted technology capable of interrogating hundreds of thousands of loci across the human genome. Combined with imputation to cover common variants not included in the chip design, they offer a cost-effective solution for large-scale genetic studies. Beyond research applications, this technology can be applied for testing pharmacogenomics, nutrigenetics, and complex disease risk prediction. However, establishing clinical reporting workflows requires a thorough evaluation of the assay’s performance, which is achieved through validation studies. In this study, we performed pre-clinical validation of a genetic testing workflow based on the Illumina Global Screening Array for 25 pharmacogenomic-related genes. Methods: To evaluate the accuracy of our workflow, we conducted multiple pre-clinical validation studies. Here, we present the results of accuracy and precision assessments, involving a total of 73 cell lines. These assessments encompass reference materials from the Genome-In-A-Bottle (GIAB), the Genetic Testing Reference Material Coordination Program (GeT-RM) projects, as well as additional samples from the 1000 Genomes project (1KGP). We conducted an accuracy assessment of genotype calls for target loci in each indication against established truth sets. Results: In our per-sample analysis, we observed a mean analytical sensitivity of 99.39% and specificity 99.98%. We further assessed the accuracy of star-allele calls by relying on established diplotypes in the GeT-RM catalogue or calls made based on 1KGP genotyping. On average, we detected a diplotype concordance rate of 96.47% across 14 pharmacogenomic-related genes with star allele-calls. Lastly, we evaluated the reproducibility of our findings across replicates and observed 99.48% diplotype and 100% phenotype inter-run concordance. Conclusion: Our comprehensive validation study demonstrates the robustness and reliability of the developed workflow, supporting its readiness for further development for applied testing.

Journal Article

Share this book

Add to My Shelf

Recombination-aware alignment of diploid individuals

by Mäkinen, Veli , Valenzuela, Daniel in Algorithms , Animal Genetics and Genomics , Biomedical and Life Sciences

2014

Background Traditionally biological similarity search has been studied under the abstraction of a single string to represent each genome. The more realistic representation of diploid genomes, with two strings defining the genome, has so far been largely omitted in this context. With the development of sequencing techniques and better phasing routines through haplotype assembly algorithms, we are not far from the situation when individual diploid genomes could be represented in their full complexity with a pair-wise alignment defining the genome. Results We propose a generalization of global alignment that is designed to measure similarity between phased predictions of individual diploid genomes. This generalization takes into account that individual diploid genomes evolve through a mutation and recombination process, and that predictions may be erroneous in both dimensions. Even though our model is generic, we focus on the case where one wants to measure only the similarity of genome content allowing free recombination. This results into efficient algorithms for direct application in (i) evaluation of variation calling predictions and (ii) progressive multiple alignments based on labeled directed acyclic graphs (DAGs) to represent profiles. The latter may be of more general interest, in connection to covering alignment of DAGs. Extensions of our model and algorithms can be foreseen to have applications in evaluating phasing algorithms, as well as more fundamental role in phasing child genome based on parent genomes.

Journal Article

Share this book

Add to My Shelf

Characterization of autosomal copy-number variation in African Americans: the HyperGEN Study

by Wojczynski, Mary K , Broeckel, Ulrich , Pajewski, Nicholas M in 631/114/2415 , 631/208/205/2138 , 631/208/457/649

2011

African Americans are a genetically diverse population with a high burden of many, common heritable diseases. However, our understanding of genetic variation in African Americans is substandard because of a lack of published population-based genetic studies. We report the distribution of copy-number variation (CNV) in African Americans collected as part of the Hypertension Genetic Epidemiology Network (HyperGEN) using the Affymetrix 6.0 array and the CNV calling algorithms Birdsuite and PennCNV. We present population estimates of CNV from 446 unrelated African-American subjects randomly selected from the 451 families collected within HyperGEN. Although the majority of CNVs discovered were individually rare, we found the frequency of CNVs to be collectively high. We identified a total of 11 070 CNVs greater than 10 kb passing quality control criteria that were called by both algorithms – leading to an average of 24.8 CNVs per person covering 2214 kb (median). We identified 1541 unique copy-number variable regions, 309 of which did not overlap with the Database of Genomic Variants. These results provide further insight into the distribution of CNV in African Americans.

Journal Article

Share this book

Add to My Shelf

Extensive sequence duplication in Arabidopsis revealed by pseudo-heterozygosity

by Pisupati, Rahul , Burns, Robin , Rabanal, Fernando A. in Animal Genetics and Genomics , Arabidopsis , Arabidopsis - genetics

2023

Background It is apparent that genomes harbor much structural variation that is largely undetected for technical reasons. Such variation can cause artifacts when short-read sequencing data are mapped to a reference genome. Spurious SNPs may result from mapping of reads to unrecognized duplicated regions. Calling SNP using the raw reads of the 1001 Arabidopsis Genomes Project we identified 3.3 million (44%) heterozygous SNPs. Given that Arabidopsis thaliana ( A. thaliana ) is highly selfing, and that extensively heterozygous individuals have been removed, we hypothesize that these SNPs reflected cryptic copy number variation. Results The heterozygosity we observe consists of particular SNPs being heterozygous across individuals in a manner that strongly suggests it reflects shared segregating duplications rather than random tracts of residual heterozygosity due to occasional outcrossing. Focusing on such pseudo-heterozygosity in annotated genes, we use genome-wide association to map the position of the duplicates. We identify 2500 putatively duplicated genes and validate them using de novo genome assemblies from six lines. Specific examples included an annotated gene and nearby transposon that transpose together. We also demonstrate that cryptic structural variation produces highly inaccurate estimates of DNA methylation polymorphism. Conclusions Our study confirms that most heterozygous SNP calls in A. thaliana are artifacts and suggest that great caution is needed when analyzing SNP data from short-read sequencing. The finding that 10% of annotated genes exhibit copy-number variation, and the realization that neither gene- nor transposon-annotation necessarily tells us what is actually mobile in the genome suggests that future analyses based on independently assembled genomes will be very informative.

Journal Article

Share this book

Add to My Shelf

SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies

by Jiao, Wen-Biao , Sun, Hequan , Schneeberger, Korbinian in Animal Genetics and Genomics , Animals , Arabidopsis

2019

Genomic differences range from single nucleotide differences to complex structural variations. Current methods typically annotate sequence differences ranging from SNPs to large indels accurately but do not unravel the full complexity of structural rearrangements, including inversions, translocations, and duplications, where highly similar sequence changes in location, orientation, or copy number. Here, we present SyRI, a pairwise whole-genome comparison tool for chromosome-level assemblies. SyRI starts by finding rearranged regions and then searches for differences in the sequences, which are distinguished for residing in syntenic or rearranged regions. This distinction is important as rearranged regions are inherited differently compared to syntenic regions.

Journal Article

Share this book

Add to My Shelf

Paragraph: a graph-based structural variant genotyper for short-read sequence data

by Chen, Sai , Sedlazeck, Fritz J. , Krusche, Peter in Accuracy , Algorithms , ancestry

2019

Accurate detection and genotyping of structural variations (SVs) from short-read data is a long-standing area of development in genomics research and clinical sequencing pipelines. We introduce Paragraph, an accurate genotyper that models SVs using sequence graphs and SV annotations. We demonstrate the accuracy of Paragraph on whole-genome sequence data from three samples using long-read SV calls as the truth set, and then apply Paragraph at scale to a cohort of 100 short-read sequenced samples of diverse ancestry. Our analysis shows that Paragraph has better accuracy than other existing genotypers and can be applied to population-scale studies.

Journal Article

Share this book

Add to My Shelf

An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar

by Quick, Joshua , Gangavarapu, Karthik , Brackney, Doug E. in Accuracy , Amplicon sequencing , Animal Genetics and Genomics

2019

How viruses evolve within hosts can dictate infection outcomes; however, reconstructing this process is challenging. We evaluate our multiplexed amplicon approach, PrimalSeq, to demonstrate how virus concentration, sequencing coverage, primer mismatches, and replicates influence the accuracy of measuring intrahost virus diversity. We develop an experimental protocol and computational tool, iVar, for using PrimalSeq to measure virus diversity using Illumina and compare the results to Oxford Nanopore sequencing. We demonstrate the utility of PrimalSeq by measuring Zika and West Nile virus diversity from varied sample types and show that the accumulation of genetic diversity is influenced by experimental and biological systems.

Journal Article

Share this book

Add to My Shelf

Structural variant analysis of a cancer reference cell line sample using multiple sequencing technologies

by Bombardi, Robin , Wang, Limin , Zhao, Yongmei in Algorithms , Animal Genetics and Genomics , BASIC BIOLOGICAL SCIENCES

2022

Background The cancer genome is commonly altered with thousands of structural rearrangements including insertions, deletions, translocation, inversions, duplications, and copy number variations. Thus, structural variant (SV) characterization plays a paramount role in cancer target identification, oncology diagnostics, and personalized medicine. As part of the SEQC2 Consortium effort, the present study established and evaluated a consensus SV call set using a breast cancer reference cell line and matched normal control derived from the same donor, which were used in our companion benchmarking studies as reference samples. Results We systematically investigated somatic SVs in the reference cancer cell line by comparing to a matched normal cell line using multiple NGS platforms including Illumina short-read, 10X Genomics linked reads, PacBio long reads, Oxford Nanopore long reads, and high-throughput chromosome conformation capture (Hi-C). We established a consensus SV call set of a total of 1788 SVs including 717 deletions, 230 duplications, 551 insertions, 133 inversions, 146 translocations, and 11 breakends for the reference cancer cell line. To independently evaluate and cross-validate the accuracy of our consensus SV call set, we used orthogonal methods including PCR-based validation, Affymetrix arrays, Bionano optical mapping, and identification of fusion genes detected from RNA-seq. We evaluated the strengths and weaknesses of each NGS technology for SV determination, and our findings provide an actionable guide to improve cancer genome SV detection sensitivity and accuracy. Conclusions A high-confidence consensus SV call set was established for the reference cancer cell line. A large subset of the variants identified was validated by multiple orthogonal methods.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter