Catalogue Search | MBRL

Automated assembly of centromeres from ultra-long error-prone reads

by Pevzner, Pavel A. , Bzikadze, Andrey V. in 631/114/2785/2302 , 631/208 , Agriculture

2020

Centromeric variation has been linked to cancer and infertility, but centromere sequences contain multiple tandem repeats and can only be assembled manually from long error-prone reads. Here we describe the centroFlye algorithm for centromere assembly using long error-prone reads, and apply it to assemble human centromeres on chromosomes 6 and X. Our analyses reveal putative breakpoints in the manual reconstruction of the human X centromere, demonstrate that human X chromosome is partitioned into repeat subfamilies and provide initial insights into centromere evolution. We anticipate that centroFlye could be applied to automatically close remaining multimegabase gaps in the reference human genome. CentroFlye resolves tandem repeats to assemble human centromeres from nanopore reads.

Journal Article

Share this book

Add to My Shelf

A scalable model for simulating multi-round antibody evolution and benchmarking of clonal tree reconstruction methods

by Safonova, Yana , Zhang, Chao , Bzikadze, Andrey V. in Algorithms , Antibodies , antibody evolution

2022

Affinity maturation (AM) of B cells through somatic hypermutations (SHMs) enables the immune system to evolve to recognize diverse pathogens. The accumulation of SHMs leads to the formation of clonal lineages of antibody-secreting b cells that have evolved from a common naïve B cell. Advances in high-throughput sequencing have enabled deep scans of B cell receptor repertoires, paving the way for reconstructing clonal trees. However, it is not clear if clonal trees, which capture microevolutionary time scales, can be reconstructed using traditional phylogenetic reconstruction methods with adequate accuracy. In fact, several clonal tree reconstruction methods have been developed to fix supposed shortcomings of phylogenetic methods. Nevertheless, no consensus has been reached regarding the relative accuracy of these methods, partially because evaluation is challenging. Benchmarking the performance of existing methods and developing better methods would both benefit from realistic models of clonal lineage evolution specifically designed for emulating B cell evolution. In this paper, we propose a model for modeling B cell clonal lineage evolution and use this model to benchmark several existing clonal tree reconstruction methods. Our model, designed to be extensible, has several features: by evolving the clonal tree and sequences simultaneously, it allows modeling selective pressure due to changes in affinity binding; it enables scalable simulations of large numbers of cells; it enables several rounds of infection by an evolving pathogen; and, it models building of memory. In addition, we also suggest a set of metrics for comparing clonal trees and measuring their properties. Our results show that while maximum likelihood phylogenetic reconstruction methods can fail to capture key features of clonal tree expansion if applied naively, a simple post-processing of their results, where short branches are contracted, leads to inferences that are better than alternative methods.

Journal Article

Share this book

Add to My Shelf

Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads

by Antipov, Dmitry , Pevzner, Pavel A. , Kolmogorov, Mikhail in 631/114/2785/2302 , 631/61/212/2302 , Accuracy

2022

Although most existing genome assemblers are based on de Bruijn graphs, the construction of these graphs for large genomes and large k -mer sizes has remained elusive. This algorithmic challenge has become particularly pressing with the emergence of long, high-fidelity (HiFi) reads that have been recently used to generate a semi-manual telomere-to-telomere assembly of the human genome. To enable automated assemblies of long, HiFi reads, we present the La Jolla Assembler (LJA), a fast algorithm using the Bloom filter, sparse de Bruijn graphs and disjointig generation. LJA reduces the error rate in HiFi reads by three orders of magnitude, constructs the de Bruijn graph for large genomes and large k -mer sizes and transforms it into a multiplex de Bruijn graph with varying k -mer sizes. Compared to state-of-the-art assemblers, our algorithm not only achieves five-fold fewer misassemblies but also generates more contiguous assemblies. We demonstrate the utility of LJA via the automated assembly of a human genome that completely assembled six chromosomes. A multiplex de Bruijn graph algorithm allows high-accuracy genome assembly from long, high-fidelity reads.

Journal Article

Share this book

Add to My Shelf

UniAligner: a parameter-free framework for fast sequence alignment

by Pevzner, Pavel A. , Bzikadze, Andrey V. in 631/114/2785 , 631/1647/794 , 631/208/212/748

2023

Even though the recent advances in ‘complete genomics’ revealed the previously inaccessible genomic regions, analysis of variations in centromeres and other extra-long tandem repeats (ETRs) faces an algorithmic challenge since there are currently no tools for accurate sequence comparison of ETRs. Counterintuitively, the classical alignment approaches, such as the Smith–Waterman algorithm, fail to construct biologically adequate alignments of ETRs. We present UniAligner—the parameter-free sequence alignment algorithm with sequence-dependent alignment scoring that automatically changes for any pair of compared sequences. UniAligner prioritizes matches of rare substrings that are more likely to be relevant to the evolutionary relationship between two sequences. We apply UniAligner to estimate the mutation rates in human centromeres, and quantify the extremely high rate of large duplications and deletions in centromeres. This high rate suggests that centromeres may represent some of the most rapidly evolving regions of the human genome with respect to their structural organization. Compared to other sequences, extra-long tandem repeats, such as centromeres and immunoglobulin loci, are more difficult to align. This study presents UniAligner, a computational method for efficiently and accurately aligning extra-long tandem repeats, facilitating analysis of their variation and evolution.

Journal Article

Share this book

Add to My Shelf

The structure, function and evolution of a complete human chromosome 8

by Sorensen, Melanie , Mikheenko, Alla , Jain, Chirag in 13/106 , 14/19 , 14/32

2021

The complete assembly of each human chromosome is essential for understanding human biology and evolution 1 , 2 . Here we use complementary long-read sequencing technologies to complete the linear assembly of human chromosome 8. Our assembly resolves the sequence of five previously long-standing gaps, including a 2.08-Mb centromeric α-satellite array, a 644-kb copy number polymorphism in the β-defensin gene cluster that is important for disease risk, and an 863-kb variable number tandem repeat at chromosome 8q21.2 that can function as a neocentromere. We show that the centromeric α-satellite array is generally methylated except for a 73-kb hypomethylated region of diverse higher-order α-satellites enriched with CENP-A nucleosomes, consistent with the location of the kinetochore. In addition, we confirm the overall organization and methylation pattern of the centromere in a diploid human genome. Using a dual long-read sequencing approach, we complete high-quality draft assemblies of the orthologous centromere from chromosome 8 in chimpanzee, orangutan and macaque to reconstruct its evolutionary history. Comparative and phylogenetic analyses show that the higher-order α-satellite structure evolved in the great ape ancestor with a layered symmetry, in which more ancient higher-order repeats locate peripherally to monomeric α-satellites. We estimate that the mutation rate of centromeric satellite DNA is accelerated by more than 2.2-fold compared to the unique portions of the genome, and this acceleration extends into the flanking sequence. The complete assembly of human chromosome 8 resolves previous gaps and reveals hidden complex forms of genetic variation, enabling functional and evolutionary characterization of primate centromeres.

Journal Article

Share this book

Add to My Shelf

Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies

by Sović, Ivan , Koren, Sergey , Wood, Jonathan M. D. in 631/114/2785 , 631/1647/794 , 631/208/212/2302

2022

Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first telomere-to-telomere human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Although derived from highly accurate sequences, evaluation revealed evidence of small errors and structural misassemblies in the initial draft assembly. To correct these errors, we designed a new repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly quality value from 70.2 to 73.9 measured from PacBio high-fidelity and Illumina k -mers. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both high-fidelity and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies. The work describes the validation and polishing strategies developed by the telomere-to-telomere consortium for evaluating and improving the first complete human genome assembly.

Journal Article

Share this book

Add to My Shelf

The complete sequence of a human Y chromosome

by Sedlazeck, Fritz J. , Watwood, Allison C. , Grady, Patrick G. S. in 119/118 , 14/32 , 45/23

2023

The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure that includes long palindromes, tandem repeats and segmental duplications 1 – 3 . As a result, more than half of the Y chromosome is missing from the GRCh38 reference sequence and it remains the last human chromosome to be finished 4 , 5 . Here, the Telomere-to-Telomere (T2T) consortium presents the complete 62,460,029-base-pair sequence of a human Y chromosome from the HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y and adds over 30 million base pairs of sequence to the reference, showing the complete ampliconic structures of gene families TSPY , DAZ and RBMY ; 41 additional protein-coding genes, mostly from the TSPY family; and an alternating pattern of human satellite 1 and 3 blocks in the heterochromatic Yq12 region. We have combined T2T-Y with a previous assembly of the CHM13 genome 4 and mapped available population variation, clinical variants and functional genomics data to produce a complete and comprehensive reference sequence for all 24 human chromosomes. We present the complete 62,460,029-base-pair sequence of a human Y chromosome from the HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y and adds over 30 million base pairs of sequence to the reference.

Journal Article

Share this book

Add to My Shelf

Human Centromeres: From Initial Assemblies to Structural and Evolutionary Analysis

by Bzikadze, Andrey V in Bioinformatics , Systematic biology

2022

Recent advances in long-read sequencing technologies allowed generation of the first complete assembly of a human genome. They revealed previously inaccessible sequences of human centromeres and allowed analysis of their structure and evolution. We introduce centroFlye—the first algorithm for automated assembly of centromeres from error-prone long reads. We then describe TandemTools and VerityMap algorithms for quality assessment of the newly assembled regions. Afterwards, we present StringDecomposer, CentromereArchitect, and HORmon algorithms for structural and evolutionary analysis of human centromeres. We introduce LJA—the first de Bruijn-based genome assembler for accurate long reads. Finally, we describe TandemAligner—the first parameter-free sequence alignment algorithm that introduces a sequence-dependent scoring that automatically changes for any pair of compared sequences.

Dissertation

Share this book

Add to My Shelf

Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies

by menti, Giulio , Koren, Sergey , Jain, Chirag in Genomes , Genomics , Hydatidiform mole

2021

Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first Telomere-to-Telomere (T2T) human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Though derived from highly accurate sequencing, evaluation revealed that the initial T2T draft assembly had evidence of small errors and structural misassemblies. To correct these errors, we designed a novel repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly QV to 73.9. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both PacBio HiFi and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies. Competing Interest Statement Ivan Sovic is employed by Pacific BioSciences Inc.

Paper

Share this book

Add to My Shelf

TandemAligner: a new parameter-free framework for fast sequence alignment

by Pevzner, Pavel A , Bzikadze, Andrey V in Algorithms , Bioinformatics , Centromeres

2022

The recent advances in \"complete genomics\" revealed the previously inaccessible genomic regions (such as centromeres) and enabled analysis of their associations with diseases. However, analysis of variations in centromeres, immunoglobulin loci, and other extra-long tandem repeats (ETRs) faces an algorithmic bottleneck since there are currently no tools for accurate sequence comparison of ETRs. Counterintuitively, the classical alignment approaches, such as the Smith-Waterman algorithm, that work well for most sequences, fail to construct biologically adequate alignments of ETRs. This limitation was overlooked in previous studies since the ETR sequences across multiple genomes only became available in the last year. We present TandemAligner - the first parameter-free sequence alignment algorithm that introduces a sequence-dependent alignment scoring that automatically changes for any pair of compared sequences. We apply TandemAligner to various human centromeres and primate immunoglobulin loci, arrive at the first accurate estimate of the mutation rates in human centromeres, and quantify the extremely high rate of large insertions/duplications in centromeres. This extremely high rate (that the standard alignment algorithms fail to uncover) suggests that centromeres represent the most rapidly evolving regions of the human genome with respect to their structural organization. Competing Interest Statement The authors have declared no competing interest. Footnotes * https://github.com/seryrzu/tandem_aligner * https://zenodo.org/record/7058133

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter