Catalogue Search | MBRL

Resolving the complexity of the human genome using single-molecule sequencing

by Boitano, Matthew , Landolin, Jane M. , Stamatoyannopoulos, John A. in 45/23 , 631/208/212/748 , 631/208/726/649/2157

2015

Single-molecule, real-time DNA sequencing is used to analyse a haploid human genome (CHM1), thus closing or extending more than half of the remaining 164 euchromatic gaps in the human genome; the complete sequences of euchromatic structural variants (including inversions, complex insertions and tandem repeats) are resolved at the base-pair level, suggesting that a greater complexity of the human genome can now be accessed. Deep-sequencing the human genome The human genome is considered sequenced, yet more than 160 euchromatic gaps remain and many aspects of its structural variation are poorly understood. Evan Eichler and colleagues sequenced and analysed a haploid human genome (CHM1) using single-molecule, real-time (SMRT) DNA sequencing and by doing so closed — or in some cases extended — more than half of the remaining gaps. They also resolved the complete sequence of numerous euchromatic structural variants at the base-pair level, revealing inversions, complex insertions and long tracts of tandem repeats, some of them previously unknown. Thanks to the longer-read sequencing technology applied here, the complexity of the human genome that stems from variation of longer and more complex repetitive DNA can now be largely resolved. The human genome is arguably the most complete mammalian reference assembly 1 , 2 , 3 , yet more than 160 euchromatic gaps remain 4 , 5 , 6 and aspects of its structural variation remain poorly understood ten years after its completion 7 , 8 , 9 . To identify missing sequence and genetic variation, here we sequence and analyse a haploid human genome (CHM1) using single-molecule, real-time DNA sequencing 10 . We close or extend 55% of the remaining interstitial gaps in the human GRCh37 reference genome—78% of which carried long runs of degenerate short tandem repeats, often several kilobases in length, embedded within (G+C)-rich genomic regions. We resolve the complete sequence of 26,079 euchromatic structural variants at the base-pair level, including inversions, complex insertions and long tracts of tandem repeats. Most have not been previously reported, with the greatest increases in sensitivity occurring for events less than 5 kilobases in size. Compared to the human reference, we find a significant insertional bias (3:1) in regions corresponding to complex insertions and long short tandem repeats. Our results suggest a greater complexity of the human genome in the form of variation of longer and more complex repetitive DNA that can now be largely resolved with the application of this longer-read sequencing technology.

Journal Article

Share this book

Add to My Shelf

Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome

by Ebler, Jana , Schatz, Michael C , Rank, David R in Assembly , Deoxyribonucleic acid , DNA sequencing

2019

The DNA sequencing technologies in use today produce either highly accurate short reads or less-accurate long reads. We report the optimization of circular consensus sequencing (CCS) to improve the accuracy of single-molecule real-time (SMRT) sequencing (PacBio) and generate highly accurate (99.8%) long high-fidelity (HiFi) reads with an average length of 13.5 kilobases (kb). We applied our approach to sequence the well-characterized human HG002/NA24385 genome and obtained precision and recall rates of at least 99.91% for single-nucleotide variants (SNVs), 95.98% for insertions and deletions <50 bp (indels) and 95.99% for structural variants. Our CCS method matches or exceeds the ability of short-read sequencing to detect small variants and structural variants. We estimate that 2,434 discordances are correctable mistakes in the ‘genome in a bottle’ (GIAB) benchmark set. Nearly all (99.64%) variants can be phased into haplotypes, further improving variant detection. De novo genome assembly using CCS reads alone produced a contiguous and accurate genome with a contig N50 of >15 megabases (Mb) and concordance of 99.997%, substantially outperforming assembly with less-accurate long reads.

Journal Article

Share this book

Add to My Shelf

De novo assembly and phasing of a Korean human genome

by Cao, Han , Shin, Jong-Yeon , Kim, Jongbum in 45/23 , 45/91 , 631/208/726

2016

De novo assembly and phasing of the genome of an individual from Korea using a combination of different sequencing approaches provides a useful population-specific reference genome and represents the most contiguous human genome assembly so far. A Korean human genome Jeong-Sun Seo and colleagues report de novo assembly and phasing of the genome of an individual from Korea using a combination of PacBio long-read sequencing, Illumina short-read sequencing, 10X Genomics linked reads, bacterial artificial chromosome (BAC) sequencing and BioNano Genomics optical mapping. This provides a useful population-specific reference genome and represents the most contiguous human genome assembly to date. The authors use this to close gaps in the human reference genome and map structural variation. Advances in genome assembly and phasing provide an opportunity to investigate the diploid architecture of the human genome and reveal the full range of structural variation across population groups. Here we report the de novo assembly and haplotype phasing of the Korean individual AK1 (ref. 1 ) using single-molecule real-time sequencing 2 , next-generation mapping 3 , microfluidics-based linked reads 4 , and bacterial artificial chromosome (BAC) sequencing approaches. Single-molecule sequencing coupled with next-generation mapping generated a highly contiguous assembly, with a contig N50 size of 17.9 Mb and a scaffold N50 size of 44.8 Mb, resolving 8 chromosomal arms into single scaffolds. The de novo assembly, along with local assemblies and spanning long reads, closes 105 and extends into 72 out of 190 euchromatic gaps in the reference genome, adding 1.03 Mb of previously intractable sequence. High concordance between the assembly and paired-end sequences from 62,758 BAC clones provides strong support for the robustness of the assembly. We identify 18,210 structural variants by direct comparison of the assembly with the human reference, identifying thousands of breakpoints that, to our knowledge, have not been reported before. Many of the insertions are reflected in the transcriptome and are shared across the Asian population. We performed haplotype phasing of the assembly with short reads, long reads and linked reads from whole-genome sequencing and with short reads from 31,719 BAC clones, thereby achieving phased blocks with an N50 size of 11.6 Mb. Haplotigs assembled from single-molecule real-time reads assigned to haplotypes on phased blocks covered 89% of genes. The haplotigs accurately characterized the hypervariable major histocompatability complex region as well as demonstrating allele configuration in clinically relevant genes such as CYP2D6 . This work presents the most contiguous diploid human genome assembly so far, with extensive investigation of unreported and Asian-specific structural variants, and high-quality haplotyping of clinically relevant alleles for precision medicine.

Journal Article

Share this book

Add to My Shelf

Whole-Genome Shotgun Assembly and Comparison of Human Genome Assemblies

by Bolanos, Randall , Venter, J. Craig , Sutton, Granger G. in Bioinformatics , Biological Sciences , Chromosomes

2004

We report a whole-genome shotgun assembly (called WGSA) of the human genome generated at Celera in 2001. The Celera-generated shotgun data set consisted of 27 million sequencing reads organized in pairs by virtue of end-sequencing 2-kbp, 10-kbp, and 50-kbp inserts from shotgun clone libraries. The quality-trimmed reads covered the genome 5.3 times, and the inserts from which pairs of reads were obtained covered the genome 39 times. With the nearly complete human DNA sequence [National Center for Biotechnology Information (NCBI) Build 34] now available, it is possible to directly assess the quality, accuracy, and completeness of WGSA and of the first reconstructions of the human genome reported in two landmark papers in February 2001 [Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001) Science 291, 1304-1351; International Human Genome Sequencing Consortium (2001) Nature 409, 860-921]. The analysis of WGSA shows 97% order and orientation agreement with NCBI Build 34, where most of the 3% of sequence out of order is due to scaffold placement problems as opposed to assembly errors within the scaffolds themselves. In addition, WGSA fills some of the remaining gaps in NCBI Build 34. The early genome sequences all covered about the same amount of the genome, but they did so in different ways. The Celera results provide more order and orientation, and the consortium sequence provides better coverage of exact and nearly exact repeats.

Journal Article

Share this book

Add to My Shelf

Rat Transforming Growth Factor Type 1: Structure and Relation to Epidermal Growth Factor

by Todaro, George J. , Hood, Leroy E. , Marquardt, Hans in 3T3 cells , Amino Acid Sequence , Amino acids

1984

The complete amino acid sequence of rat transforming growth factor type 1 has been determined. This growth factor, obtained from retrovirus-transformed fibroblasts, is structurally and functionally related to mouse epidermal growth factor and human urogastrone. Production of this polypeptide by various neoplastic cells might contribute to the continued expression of the transformed phenotype.

Journal Article

Share this book

Add to My Shelf

Protein Sequence Analysis: Automated Microsequencing

by Hunkapiller, Michael W. , Hood, Leroy E. in Amino Acid Sequence , Amino acids , Autoanalysis - instrumentation

1983

The automated microsequencing of proteins can now be carried out at the 5- to 10-picomoles (submicrogram) level on polypeptides obtained directly from one- and two-dimensional gel electrophoresis. The techniques are applicable to polypeptides ranging in size from small peptides (less than 10 residues) to large proteins (more than 1000 residues).

Journal Article

Share this book

Add to My Shelf

Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads

by Eichler, Evan E , Sulovari, Arvis , Kronenberg, Zev N in Computer applications , Fidelity , Genomes

2019

The sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective stand-alone technology for de novo assembly of human genomes.

Paper

Share this book

Add to My Shelf

Highly-accurate long-read sequencing improves variant detection and assembly of a human genome

by Ebler, Jana , Schatz, Michael C , Rank, David R in DNA sequencing , Genomes , Genomics

2019

The major DNA sequencing technologies in use today produce either highly-accurate short reads or noisy long reads. We developed a protocol based on single-molecule, circular consensus sequencing (CCS) to generate highly-accurate (99.8%) long reads averaging 13.5 kb and applied it to sequence the well-characterized human HG002/NA24385. We optimized existing tools to comprehensively detect variants, achieving precision and recall above 99.91% for SNVs, 95.98% for indels, and 95.99% for structural variants. We estimate that 2,434 discordances are correctable mistakes in the high-quality Genome in a Bottle benchmark. Nearly all (99.64%) variants are phased into haplotypes, which further improves variant detection. De novo assembly produces a highly contiguous and accurate genome with contig N50 above 15 Mb and concordance of 99.998%. CCS reads match short reads for small variant detection, while enabling structural variant detection and de novo assembly at similar contiguity and markedly higher concordance than noisy long reads.

Paper

Share this book

Add to My Shelf

De novo assembly and phasing of a Korean human genome

by Cao, Han , Shin, Jong-Yeon , Kim, Jongbum in DNA sequencing , Methods

2016

Journal Article

Share this book

Add to My Shelf

De novo assembly and phasing of a Korean human genome

by Cao, Han , Shin, Jong-Yeon , Kim, Jongbum in DNA sequencing , Methods

2016