Catalogue Search | MBRL

Symmetry of Siberian Larch Transcriptome

by Birukov, Vladislav V. , Oreshkova, Nataliya V. , Sadovsky, Michael G. in frequency dictionary , Larix , Larix sibirica

2015

The paper presents a novel approach to infer a structuredness in a set of symbol sequences such as transcriptome nucleotide sequences. A distribution pattern of triplet frequencies in the Siberian larch (Larix sibirica Ledeb) transcriptome sequences was investigated in the presented study. It was found that the larch transcriptome demonstrates a number of unexpected symmetries in the statistical and combinatorial properties.

Journal Article

Share this book

Add to My Shelf

Development of High-Density Genetic Maps for Barley and Wheat Using a Novel Two-Enzyme Genotyping-by-Sequencing Approach

by Jannink, Jean-Luc , Brown, Patrick J , Sorrells, Mark E in Agriculture , Anchoring , Animal behavior

2012

Advancements in next-generation sequencing technology have enabled whole genome re-sequencing in many species providing unprecedented discovery and characterization of molecular polymorphisms. There are limitations, however, to next-generation sequencing approaches for species with large complex genomes such as barley and wheat. Genotyping-by-sequencing (GBS) has been developed as a tool for association studies and genomics-assisted breeding in a range of species including those with complex genomes. GBS uses restriction enzymes for targeted complexity reduction followed by multiplex sequencing to produce high-quality polymorphism data at a relatively low per sample cost. Here we present a GBS approach for species that currently lack a reference genome sequence. We developed a novel two-enzyme GBS protocol and genotyped bi-parental barley and wheat populations to develop a genetically anchored reference map of identified SNPs and tags. We were able to map over 34,000 SNPs and 240,000 tags onto the Oregon Wolfe Barley reference map, and 20,000 SNPs and 367,000 tags on the Synthetic W9784×Opata85 (SynOpDH) wheat reference map. To further evaluate GBS in wheat, we also constructed a de novo genetic map using only SNP markers from the GBS data. The GBS approach presented here provides a powerful method of developing high-density markers in species without a sequenced genome while providing valuable tools for anchoring and ordering physical maps and whole-genome shotgun sequence. Development of the sequenced reference genome(s) will in turn increase the utility of GBS data enabling physical mapping of genes and haplotype imputation of missing data. Finally, as a result of low per-sample costs, GBS will have broad application in genomics-assisted plant breeding programs.

Journal Article

Share this book

Add to My Shelf

Highly accurate long reads are crucial for realizing the potential of biodiversity genomics

by Frandsen, Paul B. , Hotaling, Scott , Stewart, Russell J. in Accuracy , Analysis , Animal behavior

2023

Background Generating the most contiguous, accurate genome assemblies given available sequencing technologies is a long-standing challenge in genome science. With the rise of long-read sequencing, assembly challenges have shifted from merely increasing contiguity to correctly assembling complex, repetitive regions of interest, ideally in a phased manner. At present, researchers largely choose between two types of long read data: longer, but less accurate sequences, or highly accurate, but shorter reads (i.e., >Q20 or 99% accurate). To better understand how these types of long-read data as well as scale of data (i.e., mean length and sequencing depth) influence genome assembly outcomes, we compared genome assemblies for a caddisfly, Hesperophylax magnus , generated with longer, but less accurate, Oxford Nanopore (ONT) R9.4.1 and highly accurate PacBio HiFi (HiFi) data. Next, we expanded this comparison to consider the influence of highly accurate long-read sequence data on genome assemblies across 6750 plant and animal genomes. For this broader comparison, we used HiFi data as a surrogate for highly accurate long-reads broadly as we could identify when they were used from GenBank metadata. Results HiFi reads outperformed ONT reads in all assembly metrics tested for the caddisfly data set and allowed for accurate assembly of the repetitive ~ 20 Kb H-fibroin gene. Across plants and animals, genome assemblies that incorporated HiFi reads were also more contiguous. For plants, the average HiFi assembly was 501% more contiguous (mean contig N50 = 20.5 Mb) than those generated with any other long-read data (mean contig N50 = 4.1 Mb). For animals, HiFi assemblies were 226% more contiguous (mean contig N50 = 20.9 Mb) versus other long-read assemblies (mean contig N50 = 9.3 Mb). In plants, we also found limited evidence that HiFi may offer a unique solution for overcoming genomic complexity that scales with assembly size. Conclusions Highly accurate long-reads generated with HiFi or analogous technologies represent a key tool for maximizing genome assembly quality for a wide swath of plants and animals. This finding is particularly important when resources only allow for one type of sequencing data to be generated. Ultimately, to realize the promise of biodiversity genomics, we call for greater uptake of highly accurate long-reads in future studies.

Journal Article

Share this book

Add to My Shelf

A complete telomere-to-telomere assembly of the maize genome

by Schnable, James C. , Li, Tong , Hu, Jiang in 14/32 , 45/15 , 45/23

2023

A complete telomere-to-telomere (T2T) finished genome has been the long pursuit of genomic research. Through generating deep coverage ultralong Oxford Nanopore Technology (ONT) and PacBio HiFi reads, we report here a complete genome assembly of maize with each chromosome entirely traversed in a single contig. The 2,178.6 Mb T2T Mo17 genome with a base accuracy of over 99.99% unveiled the structural features of all repetitive regions of the genome. There were several super-long simple-sequence-repeat arrays having consecutive thymine–adenine–guanine (TAG) tri-nucleotide repeats up to 235 kb. The assembly of the entire nucleolar organizer region of the 26.8 Mb array with 2,974 45S rDNA copies revealed the enormously complex patterns of rDNA duplications and transposon insertions. Additionally, complete assemblies of all ten centromeres enabled us to precisely dissect the repeat compositions of both CentC-rich and CentC-poor centromeres. The complete Mo17 genome represents a major step forward in understanding the complexity of the highly recalcitrant repetitive regions of higher plant genomes. A complete telomere-to-telomere genome assembly of the maize Mo17 inbred line uncovers structural features of the highly complex maize genome.

Journal Article

Share this book

Add to My Shelf

Mining sequence variations in representative polyploid sugarcane germplasm accessions

by Yang, Xiping , You, Qian , Wang, Jianping in 09 BIOMASS FUELS , Agronomy , Alignment

2017

Background Sugarcane ( Saccharum spp.) is one of the most important economic crops because of its high sugar production and biofuel potential. Due to the high polyploid level and complex genome of sugarcane, it has been a huge challenge to investigate genomic sequence variations, which are critical for identifying alleles contributing to important agronomic traits. In order to mine the genetic variations in sugarcane, genotyping by sequencing (GBS), was used to genotype 14 representative Saccharum complex accessions. GBS is a method to generate a large number of markers, enabled by next generation sequencing (NGS) and the genome complexity reduction using restriction enzymes. Results To use GBS for high throughput genotyping highly polyploid sugarcane, the GBS analysis pipelines in 14 Saccharum complex accessions were established by evaluating different alignment methods, sequence variants callers, and sequence depth for single nucleotide polymorphism (SNP) filtering. By using the established pipeline, a total of 76,251 non-redundant SNPs, 5642 InDels, 6380 presence/absence variants (PAVs), and 826 copy number variations (CNVs) were detected among the 14 accessions. In addition, non-reference based universal network enabled analysis kit and Stacks de novo called 34,353 and 109,043 SNPs, respectively. In the 14 accessions, the percentages of single dose SNPs ranged from 38.3% to 62.3% with an average of 49.6%, much more than the portions of multiple dosage SNPs. Concordantly called SNPs were used to evaluate the phylogenetic relationship among the 14 accessions. The results showed that the divergence time between the Erianthus genus and the Saccharum genus was more than 10 million years ago (MYA). The Saccharum species separated from their common ancestors ranging from 0.19 to 1.65 MYA. Conclusions The GBS pipelines including the reference sequences, alignment methods, sequence variant callers, and sequence depth were recommended and discussed for the Saccharum complex and other related species. A large number of sequence variations were discovered in the Saccharum complex, including SNPs, InDels, PAVs, and CNVs. Genome-wide SNPs were further used to illustrate sequence features of polyploid species and demonstrated the divergence of different species in the Saccharum complex. The results of this study showed that GBS was an effective NGS-based method to discover genomic sequence variations in highly polyploid and heterozygous species.

Journal Article

Share this book

Add to My Shelf

A global survey of alternative splicing in allopolyploid cotton

by Fan Liang , Feng Wang , Liuling Pei in allopolyploidy , Alternative Splicing , alternative splicing (AS)

2018

Alternative splicing (AS) is a crucial regulatory mechanism in eukaryotes, which acts by greatly increasing transcriptome diversity. The extent and complexity of AS has been revealed in model plants using high-throughput next-generation sequencing. However, this technique is less effective in accurately identifying transcript isoforms in polyploid species because of the high sequence similarity between coexisting subgenomes. Here we characterize AS in the polyploid species cotton. Using Pacific Biosciences single-molecule long-read isoform sequencing (Iso-Seq), we developed an integrated pipeline for Iso-Seq transcriptome data analysis (https://github.com/Nextomics/pipeline-for-isoseq). We identified 176 849 full-length transcript isoforms from 44 968 gene models and updated gene annotation. These data led us to identify 15 102 fibre-specific AS events and estimate that c. 51.4% of homoeologous genes produce divergent isoforms in each subgenome. We reveal that AS allows differential regulation of the same gene by miRNAs at the isoform level. We also show that nucleosome occupancy and DNA methylation play a role in defining exons at the chromatin level. This study provides new insights into the complexity and regulation of AS, and will enhance our understanding of AS in polyploid species. Our methodology for Iso-Seq data analysis will be a useful reference for the study of AS in other species.

Journal Article

Share this book

Add to My Shelf

Probing the physical limits of reliable DNA data retrieval

by Dumas Ang, Siena , Strauss, Karin , Organick, Lee in 631/61/338 , 639/705/117 , Base Sequence

2020

Synthetic DNA is gaining momentum as a potential storage medium for archival data storage. In this process, digital information is translated into sequences of nucleotides and the resulting synthetic DNA strands are then stored for later retrieval. Here, we demonstrate reliable file recovery with PCR-based random access when as few as ten copies per sequence are stored, on average. This results in density of about 17 exabytes/gram, nearly two orders of magnitude greater than prior work has shown. We successfully retrieve the same data in a complex pool of over 10 10 unique sequences per microliter with no evidence that we have begun to approach complexity limits. Finally, we also investigate the effects of file size and sequencing coverage on successful file retrieval and look for systematic DNA strand drop out. These findings substantiate the robustness and high data density of the process examined here. The physical limits and reliability of PCR-based random access of DNA encoded data is unknown. Here the authors demonstrate reliable file recovery from as few as ten copies per sequence, providing a data density limit of 17 exabytes per gram.

Journal Article

Share this book

Add to My Shelf

Whole-genome and targeted sequencing of drug-resistant Mycobacterium tuberculosis on the iSeq100 and MiSeq: A performance, ease-of-use, and cost evaluation

by Rodwell, Timothy C. , Colman, Rebecca E. , Mace, Aurélien in Antitubercular agents , Bioinformatics , Capital costs

2019

Accurate, comprehensive, and timely detection of drug-resistant tuberculosis (TB) is essential to inform patient treatment and enable public health surveillance. This is crucial for effective control of TB globally. Whole-genome sequencing (WGS) and targeted next-generation sequencing (NGS) approaches have potential as rapid in vitro diagnostics (IVDs), but the complexity of workflows, interpretation of results, high costs, and vulnerability of instrumentation have been barriers to broad uptake outside of reference laboratories, especially in low- and middle-income countries. A new, solid-state, tabletop sequencing instrument, Illumina iSeq100, has the potential to decentralize NGS for individual patient care. In this study, we evaluated WGS and targeted NGS for TB on both the new iSeq100 and the widely used MiSeq (both manufactured by Illumina) and compared sequencing performance, costs, and usability. We utilized DNA libraries produced from Mycobacterium tuberculosis clinical isolates for the evaluation. We conducted WGS on three strains and observed equivalent uniform genome coverage with both platforms and found the depth of coverage obtained was consistent with the expected data output. Utilizing the standardized, cloud-based ReSeqTB bioinformatics pipeline for variant analysis, we found the two platforms to have 94.0% (CI 93.1%-94.8%) agreement, in comparison to 97.6% (CI 97%-98.1%) agreement for the same libraries on two MiSeq instruments. For the targeted NGS approach, 46 M. tuberculosis-specific amplicon libraries had 99.6% (CI 98.0%-99.9%) agreement between the iSeq100 and MiSeq data sets in drug resistance-associated SNPs. The upfront capital costs are almost 5-fold lower for the iSeq100 ($19,900 USD) platform in comparison to the MiSeq ($99,000 USD); however, because of difference in the batching capabilities, the price per sample for WGS was higher on the iSeq100. For WGS of M. tuberculosis at the minimum depth of coverage of 30x, the cost per sample on the iSeq100 was $69.44 USD versus $28.21 USD on the MiSeq, assuming a 2 × 150 bp run on a v3 kit. In terms of ease of use, the sequencing workflow of iSeq100 has been optimized to only require 27 minutes total of hands-on time pre- and post-run, and the maintenance is simplified by a single-use cartridge-based fluidic system. As these are the first sequencing attempts on the iSeq100 for M. tuberculosis, the sequencing pool loading concentration still needs optimization, which will affect sequencing error and depth of coverage. Additionally, the costs are based on current equipment and reagent costs, which are subject to change. The iSeq100 instrument is capable of running existing TB WGS and targeted NGS library preparations with comparable accuracy to the MiSeq. The iSeq100 has reduced sequencing workflow hands-on time and is able to deliver sequencing results in <24 hours. Reduced capital and maintenance costs and lower-throughput capabilities also give the iSeq100 an advantage over MiSeq in settings of individualized care but not in high-throughput settings such as reference laboratories, where sample batching can be optimized to minimize cost at the expense of workflow complexity and time.

Journal Article

Share this book

Add to My Shelf

Monkeypox virus genomic accordion strategies

by García-Sastre, Adolfo , Sánchez-Seco, Maripaz P. , Vidal-Freire, Santiago in 45/23 , 631/181/735 , 631/326/596/1746

2024

The 2023 monkeypox (mpox) epidemic was caused by a subclade IIb descendant of a monkeypox virus (MPXV) lineage traced back to Nigeria in 1971. Person-to-person transmission appears higher than for clade I or subclade IIa MPXV, possibly caused by genomic changes in subclade IIb MPXV. Key genomic changes could occur in the genome’s low-complexity regions (LCRs), which are challenging to sequence and are often dismissed as uninformative. Here, using a combination of highly sensitive techniques, we determine a high-quality MPXV genome sequence of a representative of the current epidemic with LCRs resolved at unprecedented accuracy. This reveals significant variation in short tandem repeats within LCRs. We demonstrate that LCR entropy in the MPXV genome is significantly higher than that of single-nucleotide polymorphisms (SNPs) and that LCRs are not randomly distributed. In silico analyses indicate that expression, translation, stability, or function of MPXV orthologous poxvirus genes (OPGs), including OPG153 , OPG204 , and OPG208 , could be affected in a manner consistent with the established “genomic accordion” evolutionary strategies of orthopoxviruses. We posit that genomic studies focusing on phenotypic MPXV differences should consider LCR variability. The 2023 monkeypox outbreak was caused by a subclade IIb monkeypox virus (MPXV). Here, using advanced sequencing techniques, the authors identify variations on low-complexity regions of the MPXV genome and describe their potential as evolutionary drivers.

Journal Article

Share this book

Add to My Shelf

Improved Algorithmic Complexity for the 3SEQ Recombination Detection Algorithm

by Lam, Ha Minh , Boni, Maciej F , Ratmann, Oliver in Algorithms , Biology , Complexity

2018

Identifying recombinant sequences in an era of large genomic databases is challenging as it requires an efficient algorithm to identify candidate recombinants and parents, as well as appropriate statistical methods to correct for the large number of comparisons performed. In 2007, a computation was introduced for an exact nonparametric mosaicism statistic that gave high-precision P values for putative recombinants. This exact computation meant that multiple-comparisons corrected P values also had high precision, which is crucial when performing millions or billions of tests in large databases. Here, we introduce an improvement to the algorithmic complexity of this computation from O(mn3) to O(mn2), where m and n are the numbers of recombination-informative sites in the candidate recombinant. This new computation allows for recombination analysis to be performed in alignments with thousands of polymorphic sites. Benchmark runs are presented on viral genome sequence alignments, new features are introduced, and applications outside recombination analysis are discussed.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter