Catalogue Search | MBRL

Resolving the complexity of the human genome using single-molecule sequencing

by Boitano, Matthew , Landolin, Jane M. , Stamatoyannopoulos, John A. in 45/23 , 631/208/212/748 , 631/208/726/649/2157

2015

Single-molecule, real-time DNA sequencing is used to analyse a haploid human genome (CHM1), thus closing or extending more than half of the remaining 164 euchromatic gaps in the human genome; the complete sequences of euchromatic structural variants (including inversions, complex insertions and tandem repeats) are resolved at the base-pair level, suggesting that a greater complexity of the human genome can now be accessed. Deep-sequencing the human genome The human genome is considered sequenced, yet more than 160 euchromatic gaps remain and many aspects of its structural variation are poorly understood. Evan Eichler and colleagues sequenced and analysed a haploid human genome (CHM1) using single-molecule, real-time (SMRT) DNA sequencing and by doing so closed — or in some cases extended — more than half of the remaining gaps. They also resolved the complete sequence of numerous euchromatic structural variants at the base-pair level, revealing inversions, complex insertions and long tracts of tandem repeats, some of them previously unknown. Thanks to the longer-read sequencing technology applied here, the complexity of the human genome that stems from variation of longer and more complex repetitive DNA can now be largely resolved. The human genome is arguably the most complete mammalian reference assembly 1 , 2 , 3 , yet more than 160 euchromatic gaps remain 4 , 5 , 6 and aspects of its structural variation remain poorly understood ten years after its completion 7 , 8 , 9 . To identify missing sequence and genetic variation, here we sequence and analyse a haploid human genome (CHM1) using single-molecule, real-time DNA sequencing 10 . We close or extend 55% of the remaining interstitial gaps in the human GRCh37 reference genome—78% of which carried long runs of degenerate short tandem repeats, often several kilobases in length, embedded within (G+C)-rich genomic regions. We resolve the complete sequence of 26,079 euchromatic structural variants at the base-pair level, including inversions, complex insertions and long tracts of tandem repeats. Most have not been previously reported, with the greatest increases in sensitivity occurring for events less than 5 kilobases in size. Compared to the human reference, we find a significant insertional bias (3:1) in regions corresponding to complex insertions and long short tandem repeats. Our results suggest a greater complexity of the human genome in the form of variation of longer and more complex repetitive DNA that can now be largely resolved with the application of this longer-read sequencing technology.

Journal Article

Share this book

Add to My Shelf

Highly accurate long-read HiFi sequencing data for five complex genomes

by Tsai Yu-Chih , Rank, David R , Kudrna, David in Algorithms , Datasets , Deoxyribonucleic acid

2020

The PacBio® HiFi sequencing method yields highly accurate long-read sequencing datasets with read lengths averaging 10–25 kb and accuracies greater than 99.5%. These accurate long reads can be used to improve results for complex applications such as single nucleotide and structural variant detection, genome assembly, assembly of difficult polyploid or highly repetitive genomes, and assembly of metagenomes. Currently, there is a need for sample data sets to both evaluate the benefits of these long accurate reads as well as for development of bioinformatic tools including genome assemblers, variant callers, and haplotyping algorithms. We present deep coverage HiFi datasets for five complex samples including the two inbred model genomes Mus musculus and Zea mays, as well as two complex genomes, octoploid Fragaria × ananassa and the diploid anuran Rana muscosa. Additionally, we release sequence data from a mock metagenome community. The datasets reported here can be used without restriction to develop new algorithms and explore complex genome structure and evolution. Data were generated on the PacBio Sequel II System.Measurement(s)DNA • genome • MetagenomeTechnology Type(s)DNA sequencing • PacBio Sequel SystemFactor Type(s)organism that had its genome sequencedSample Characteristic - OrganismMus musculus • Rana muscosa • Fragaria x ananassa • Zea maysMachine-accessible metadata file describing the reported data: 10.6084/m9.figshare.12855527

Journal Article

Share this book

Add to My Shelf

Assembling large genomes with single-molecule sequencing and locality-sensitive hashing

by Koren, Sergey , Berlin, Konstantin , Landolin, Jane M in 45/23 , 631/208/2156 , 631/208/726/2001/1428

2015

An assembly algorithm that overlaps noisy long reads enables accurate and fast assembly of large genomes from single-molecule real-time sequences. Long-read, single-molecule real-time (SMRT) sequencing is routinely used to finish microbial genomes, but available assembly methods have not scaled well to larger genomes. We introduce the MinHash Alignment Process (MHAP) for overlapping noisy, long reads using probabilistic, locality-sensitive hashing. Integrating MHAP with the Celera Assembler enabled reference-grade de novo assemblies of Saccharomyces cerevisiae , Arabidopsis thaliana , Drosophila melanogaster and a human hydatidiform mole cell line (CHM1) from SMRT sequencing. The resulting assemblies are highly continuous, include fully resolved chromosome arms and close persistent gaps in these reference genomes. Our assembly of D. melanogaster revealed previously unknown heterochromatic and telomeric transition sequences, and we assembled low-complexity sequences from CHM1 that fill gaps in the human GRCh38 reference. Using MHAP and the Celera Assembler, single-molecule sequencing can produce de novo near-complete eukaryotic assemblies that are 99.99% accurate when compared with available reference genomes.

Journal Article

Share this book

Add to My Shelf

Long-read, whole-genome shotgun sequence data for five model organisms

by Babayan, Primo , Rapicavoli, Nicole A , Rank, David R in 631/1647/334 , 631/1647/514/1948 , 631/61/212

2014

Single molecule, real-time (SMRT) sequencing from Pacific Biosciences is increasingly used in many areas of biological research including de novo genome assembly, structural-variant identification, haplotype phasing, mRNA isoform discovery, and base-modification analyses. High-quality, public datasets of SMRT sequences can spur development of analytic tools that can accommodate unique characteristics of SMRT data (long read lengths, lack of GC or amplification bias, and a random error profile leading to high consensus accuracy). In this paper, we describe eight high-coverage SMRT sequence datasets from five organisms ( Escherichia coli , Saccharomyces cerevisiae , Neurospora crassa , Arabidopsis thaliana , and Drosophila melanogaster) that have been publicly released to the general scientific community (NCBI Sequence Read Archive ID SRP040522). Data were generated using two sequencing chemistries (P4C2 and P5C3) on the PacBio RS II instrument. The datasets reported here can be used without restriction by the research community to generate whole-genome assemblies, test new algorithms, investigate genome structure and evolution, and identify base modifications in some of the most widely-studied model systems in biological research. Design Type(s) observation design • genome sequencing • Shotgun Sequencing Measurement Type(s) DNA sequencing Technology Type(s) PacBio RS II Factor Type(s) Sample Characteristic(s) Escherichia coli str. K-12 substr. MG1655 • Saccharomyces cerevisiae W303 • Neurospora crassa OR74A • Neurospora crassa • Arabidopsis thaliana • Drosophila melanogaster Machine-accessible metadata file describing the reported data (ISA-Tab format)

Journal Article

Share this book

Add to My Shelf

The developmental transcriptome of Drosophila melanogaster

by Artieri, Carlo G. , Landolin, Jane M. , Langton, Laura in 631/136/334/1582/715 , 631/208/212/2019 , Alternative Splicing - genetics

2011

Drosophila melanogaster is one of the most well studied genetic model organisms; nonetheless, its genome still contains unannotated coding and non-coding genes, transcripts, exons and RNA editing sites. Full discovery and annotation are pre-requisites for understanding how the regulation of transcription, splicing and RNA editing directs the development of this complex organism. Here we used RNA-Seq, tiling microarrays and cDNA sequencing to explore the transcriptome in 30 distinct developmental stages. We identified 111,195 new elements, including thousands of genes, coding and non-coding transcripts, exons, splicing and editing events, and inferred protein isoforms that previously eluded discovery using established experimental, prediction and conservation-based approaches. These data substantially expand the number of known transcribed elements in the Drosophila genome and provide a high-resolution view of transcriptome dynamics throughout development. Elements of gene function Three papers in this issue of Nature report on the modENCODE initiative, which aims to characterize functional DNA elements in the fruitfly Drosophila melanogaster and the roundworm Caenorhabditis elegans . Kharchenko et al . present a genome-wide chromatin landscape of the fruitfly, based on 18 histone modifications. They describe nine prevalent chromatin states. Integrating these analyses with other data types reveals individual characteristics of different genomic elements. Graveley et al . have used RNA-Seq, tiling microarrays and cDNA sequencing to explore the transcriptome in 30 distinct developmental stages of the fruitfly. Among the results are scores of new genes, coding and non-coding transcripts, as well as splicing and editing events. Finally, Nègre et al . have produced a map of the regulatory part of the fruitfly genome, defining a vast array of putative regulatory elements, such as enhancers, promoters, insulators and silencers. As part of the modENCODE initiative, which aims to characterize functional DNA elements in D. melanogaster and C. elegans , this study uses RNA-Seq, tiling microarrays and cDNA sequencing to explore the transcriptome in 30 distinct developmental stages of the fruitfly. Among the results are scores of new genes, coding and non-coding transcripts, as well as splicing and editing events.

Journal Article

Share this book

Add to My Shelf

Erratum: Corrigendum: Assembling large genomes with single-molecule sequencing and locality-sensitive hashing

by Koren, Sergey , Berlin, Konstantin , Landolin, Jane M

2015

Journal Article

Share this book

Add to My Shelf

Correction: Corrigendum: Assembling large genomes with single-molecule sequencing and locality-sensitive hashing

by Koren, Sergey , Berlin, Konstantin , Chin, Chen-Shan in Agriculture , Bioinformatics , Biomedical and Life Sciences

2015

Nat. Biotechnol. 33, 623–630 (2015); published online 25 May 2015; corrected after print 6 October 2015 In the version of this article initially published, equation 9 appeared incorrectly as: The equation has been corrected in the HTML and PDF versions of the article.

Journal Article

Share this book

Add to My Shelf

Corrigendum: Assembling large genomes with single-molecule sequencing and locality-sensitive hashing

by Koren, Sergey , Berlin, Konstantin , Landolin, Jane M

2015

Journal Article

Share this book

Add to My Shelf

Resolving the complexity of the human genome using single-molecule sequencing

by Huddleston, John , Dennis, Megan Y. , Malig, Maika in DNA sequencing , Genetic research , Human genome

2015

Single-molecule, real-time DNA sequencing is used to analyse a haploid human genome (CHM1), thus closing or extending more than half of the remaining 164 euchromatic gaps in the human genome; the complete sequences of euchromatic structural variants (including inversions, complex insertions and tandem repeats) are resolved at the base-pair level, suggesting that a greater complexity of the human genome can now be accessed.

Journal Article

Share this book

Add to My Shelf

Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing

by Drake, James , Koren, Sergey , Chen-Shan, Chin in Bioinformatics , DNA sequencing , Eukaryotes

2014

We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter