Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
1,516
result(s) for
"631/114/2785"
Sort by:
Genetic variation and the de novo assembly of human genomes
by
Eichler, Evan E.
,
Wilson, Richard K.
,
Chaisson, Mark J. P.
in
631/114/2785
,
631/114/2785/2302
,
631/208/212/2301
2015
Key Points
Complete
de novo
assembly of a genome is guaranteed to allow assessment of the full range of genetic variation, although the only mammalian genome assemblies completed to date are for human and mouse. Assemblies using massively parallel sequencing (MPS) have increased the diversity of draft genomes that are available but do not completely resolve genomes.
When designing a
de novo
assembly project, the most-suitable assembly approach to use differs depending on the characteristics of the sequencing reads. MPS methods have relied on de Bruijn graphs, whereas single-molecule sequencing (SMS) reads require pairwise overlaps encoded in overlap or string graphs.
A component of 'missing heritability' is missed sequence variation. Approximately 5–40 Mb of sequence are absent from any given human reference genome owing to structural polymorphism, and standard resequencing has missed detection of diseases such as medullary cystic kidney disease type 1, amyotrophic lateral sclerosis and facioscapulohumeral muscular dystrophy.
Single-molecule long-read sequencing is currently driving gains in genome assembly accuracy and completeness, but new technologies are being developed to generate long-range information, such as optical maps and dilution pool sequencing, that may aid in scaffolding complex regions.
The wealth of existing and emerging DNA-sequencing data provides an opportunity for a comprehensive understanding of human genetic variation, including the discovery of disease-causing variants. This Review describes how the limitations of current reference-genome assemblies confound the characterization of genetic variation and how this can be mitigated by important advances in algorithms and sequencing technology that facilitate the
de novo
assembly of genomes.
The discovery of genetic variation and the assembly of genome sequences are both inextricably linked to advances in DNA-sequencing technology. Short-read massively parallel sequencing has revolutionized our ability to discover genetic variation but is insufficient to generate high-quality genome assemblies or resolve most structural variation. Full resolution of variation is only guaranteed by complete
de novo
assembly of a genome. Here, we review approaches to genome assembly, the nature of gaps or missing sequences, and biases in the assembly process. We describe the challenges of generating a complete
de novo
genome assembly using current technologies and the impact that being able to perfectly sequence the genome would have on understanding human disease and evolution. Finally, we summarize recent technological advances that improve both contiguity and accuracy and emphasize the importance of complete
de novo
assembly as opposed to read mapping as the primary means to understanding the full range of human genetic variation.
Journal Article
CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning
by
Parks, Donovan H.
,
Tyson, Gene W.
,
Woodcroft, Ben J.
in
631/114/1305
,
631/114/2785
,
631/114/794
2023
Advances in sequencing technologies and bioinformatics tools have dramatically increased the recovery rate of microbial genomes from metagenomic data. Assessing the quality of metagenome-assembled genomes (MAGs) is a critical step before downstream analysis. Here, we present CheckM2, an improved method of predicting genome quality of MAGs using machine learning. Using synthetic and experimental data, we demonstrate that CheckM2 outperforms existing tools in both accuracy and computational speed. In addition, CheckM2’s database can be rapidly updated with new high-quality reference genomes, including taxa represented only by a single genome. We also show that CheckM2 accurately predicts genome quality for MAGs from novel lineages, even for those with reduced genome size (for example, Patescibacteria and the DPANN superphylum). CheckM2 provides accurate genome quality predictions across bacterial and archaeal lineages, giving increased confidence when inferring biological conclusions from MAGs.
This work presents CheckM2, which is a machine learning-based tool to predict genome quality of isolate, single-cell and metagenome-assembled genomes.
Journal Article
CheckV assesses the quality and completeness of metagenome-assembled viral genomes
by
Camargo, Antonio Pedro
,
Schulz, Frederik
,
Kyrpides, Nikos C.
in
631/114/2785
,
631/326/596/2142
,
Agriculture
2021
Millions of new viral sequences have been identified from metagenomes, but the quality and completeness of these sequences vary considerably. Here we present CheckV, an automated pipeline for identifying closed viral genomes, estimating the completeness of genome fragments and removing flanking host regions from integrated proviruses. CheckV estimates completeness by comparing sequences with a large database of complete viral genomes, including 76,262 identified from a systematic search of publicly available metagenomes, metatranscriptomes and metaviromes. After validation on mock datasets and comparison to existing methods, we applied CheckV to large and diverse collections of metagenome-assembled viral sequences, including IMG/VR and the Global Ocean Virome. This revealed 44,652 high-quality viral genomes (that is, >90% complete), although the vast majority of sequences were small fragments, which highlights the challenge of assembling viral genomes from short-read metagenomes. Additionally, we found that removal of host contamination substantially improved the accurate identification of auxiliary metabolic genes and interpretation of viral-encoded functions.
The quality of viral genomes assembled from metagenome data is assessed by CheckV.
Journal Article
Tradeoffs in alignment and assembly-based methods for structural variant detection with long-read sequencing data
2024
Long-read sequencing offers long contiguous DNA fragments, facilitating diploid genome assembly and structural variant (SV) detection. Efficient and robust algorithms for SV identification are crucial with increasing data availability. Alignment-based methods, favored for their computational efficiency and lower coverage requirements, are prominent. Alternative approaches, relying solely on available reads for de novo genome assembly and employing assembly-based tools for SV detection via comparison to a reference genome, demand significantly more computational resources. However, the lack of comprehensive benchmarking constrains our comprehension and hampers further algorithm development. Here we systematically compare 14 read alignment-based SV calling methods (including 4 deep learning-based methods and 1 hybrid method), and 4 assembly-based SV calling methods, alongside 4 upstream aligners and 7 assemblers. Assembly-based tools excel in detecting large SVs, especially insertions, and exhibit robustness to evaluation parameter changes and coverage fluctuations. Conversely, alignment-based tools demonstrate superior genotyping accuracy at low sequencing coverage (5-10×) and excel in detecting complex SVs, like translocations, inversions, and duplications. Our evaluation provides performance insights, highlighting the absence of a universally superior tool. We furnish guidelines across 31 criteria combinations, aiding users in selecting the most suitable tools for diverse scenarios and offering directions for further method development.
Long-read sequencing can greatly improve detection of genomic structural variants (SVs), and numerous methods have been developed to identify SVs using long-read data. Here the authors compare the performance of these methods and provide guidelines to aid users in selecting the most suitable tools for various scenarios.
Journal Article
GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes
by
Ranallo-Benavidez, T. Rhyker
,
Jaron, Kamil S.
,
Schatz, Michael C.
in
631/114
,
631/114/2785
,
631/1647/2217
2020
An important assessment prior to genome assembly and related analyses is genome profiling, where the k-mer frequencies within raw sequencing reads are analyzed to estimate major genome characteristics such as size, heterozygosity, and repetitiveness. Here we introduce GenomeScope 2.0 (
https://github.com/tbenavi1/genomescope2.0
), which applies combinatorial theory to establish a detailed mathematical model of how k-mer frequencies are distributed in heterozygous and polyploid genomes. We describe and evaluate a practical implementation of the polyploid-aware mixture model that quickly and accurately infers genome properties across thousands of simulated and several real datasets spanning a broad range of complexity. We also present a method called Smudgeplot (
https://github.com/KamilSJaron/smudgeplot
) to visualize and estimate the ploidy and genome structure of a genome by analyzing heterozygous k-mer pairs. We successfully apply the approach to systems of known variable ploidy levels in the
Meloidogyne
genus and the extreme case of octoploid
Fragaria
×
ananassa
.
Prior to genome assembly, the raw sequencing reads must be analyzed for assessment of major genome characteristics such as genome size, heterozygosity, and repetitiveness. For this purpose, the authors introduce GenomeScope 2.0, an extension of GenomeScope for polyploid genomes, and Smudgeplot, which can estimate a genome’s ploidy.
Journal Article
Haplotype-resolved assembly of diploid genomes without parental data
by
Jarvis, Erich D.
,
Fedrigo, Olivier
,
Gemmell, Neil J.
in
631/114/2785/2302
,
631/114/794
,
Agriculture
2022
Routine haplotype-resolved genome assembly from single samples remains an unresolved problem. Here we describe an algorithm that combines PacBio HiFi reads and Hi-C chromatin interaction data to produce a haplotype-resolved assembly without the sequencing of parents. Applied to human and other vertebrate samples, our algorithm consistently outperforms existing single-sample assembly pipelines and generates assemblies of similar quality to the best pedigree-based assemblies.
Haplotype-resolved genome assemblies are generated by combining HiFi reads with Hi-C long-range interactions.
Journal Article
AMRFinderPlus and the Reference Gene Catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence
by
Brover, Vyacheslav
,
Pettengill, James B.
,
Klimke, William
in
631/114
,
631/114/129
,
631/114/2785
2021
Antimicrobial resistance (AMR) is a significant public health threat. With the rise of affordable whole genome sequencing, in silico approaches to assessing AMR gene content can be used to detect known resistance mechanisms and potentially identify novel mechanisms. To enable accurate assessment of AMR gene content, as part of a multi-agency collaboration, NCBI developed a comprehensive AMR gene database, the Bacterial Antimicrobial Resistance Reference Gene Database and the AMR gene detection tool AMRFinder. Here, we describe the expansion of the Reference Gene Database, now called the Reference Gene Catalog, to include putative acid, biocide, metal, stress resistance genes, in addition to virulence genes and species-specific point mutations. Genes and point mutations are classified by broad functions, as well as more detailed functions. As we have expanded both the functional repertoire of identified genes and functionality, NCBI released a new version of AMRFinder, known as AMRFinderPlus. This new tool allows users the option to utilize only the core set of AMR elements, or include stress response and virulence genes, too. AMRFinderPlus can detect acquired genes and point mutations in both protein and nucleotide sequence. In addition, the evidence used to identify the gene has been expanded to include whether nucleotide or protein sequence was used, its location in the contig, and presence of an internal stop codon. These database improvements and functional expansions will enable increased precision in identifying AMR genes, linking AMR genotypes and phenotypes, and determining possible relationships between AMR, virulence, and stress response.
Journal Article
Sensitive protein alignments at tree-of-life scale using DIAMOND
by
Reuter, Klaus
,
Drost Hajk-Georg
,
Buchfink Benjamin
in
Amino acid sequence
,
Comparative analysis
,
Diamonds
2021
We are at the beginning of a genomic revolution in which all known species are planned to be sequenced. Accessing such data for comparative analyses is crucial in this new age of data-driven biology. Here, we introduce an improved version of DIAMOND that greatly exceeds previous search performances and harnesses supercomputing to perform tree-of-life scale protein alignments in hours, while matching the sensitivity of the gold standard BLASTP.An updated version of DIAMOND uses improved algorithmic procedures and a customized high-performance computing framework to make seemingly prohibitive large-scale protein sequence alignments feasible.
Journal Article
Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm
2021
Haplotype-resolved de novo assembly is the ultimate solution to the study of sequence variations in a genome. However, existing algorithms either collapse heterozygous alleles into one consensus copy or fail to cleanly separate the haplotypes to produce high-quality phased assemblies. Here we describe hifiasm, a de novo assembler that takes advantage of long high-fidelity sequence reads to faithfully represent the haplotype information in a phased assembly graph. Unlike other graph-based assemblers that only aim to maintain the contiguity of one haplotype, hifiasm strives to preserve the contiguity of all haplotypes. This feature enables the development of a graph trio binning algorithm that greatly advances over standard trio binning. On three human and five nonhuman datasets, including California redwood with a ~30-Gb hexaploid genome, we show that hifiasm frequently delivers better assemblies than existing tools and consistently outperforms others on haplotype-resolved assembly.Hifiasm is a haplotype-resolved de novo genome assembler for long-read high-fidelity sequencing data based on phased assembly graphs.
Journal Article
A joint NCBI and EMBL-EBI transcript set for clinical genomics and research
by
Cunningham, Fiona
,
Joardar, Vinita S.
,
Webb, David
in
631/114/129
,
631/114/2184
,
631/114/2416
2022
Comprehensive genome annotation is essential to understand the impact of clinically relevant variants. However, the absence of a standard for clinical reporting and browser display complicates the process of consistent interpretation and reporting. To address these challenges, Ensembl/GENCODE
1
and RefSeq
2
launched a joint initiative, the Matched Annotation from NCBI and EMBL-EBI (MANE) collaboration, to converge on human gene and transcript annotation and to jointly define a high-value set of transcripts and corresponding proteins. Here, we describe the MANE transcript sets for use as universal standards for variant reporting and browser display. The MANE Select set identifies a representative transcript for each human protein-coding gene, whereas the MANE Plus Clinical set provides additional transcripts at loci where the Select transcripts alone are not sufficient to report all currently known clinical variants. Each MANE transcript represents an exact match between the exonic sequences of an Ensembl/GENCODE transcript and its counterpart in RefSeq such that the identifiers can be used synonymously. We have now released MANE Select transcripts for 97% of human protein-coding genes, including all American College of Medical Genetics and Genomics Secondary Findings list v3.0 (ref.
3
) genes. MANE transcripts are accessible from major genome browsers and key resources. Widespread adoption of these transcript sets will increase the consistency of reporting, facilitate the exchange of data regardless of the annotation source and help to streamline clinical interpretation.
Matched Annotation from NCBI and EMBL-EBI (MANE) delivers joint transcript sets from Ensembl/GENCODE and RefSeq for standardizing variant reporting in clinical genomics and research.
Journal Article