Catalogue Search | MBRL

SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies

by Jiao, Wen-Biao , Sun, Hequan , Schneeberger, Korbinian in Animal Genetics and Genomics , Animals , Arabidopsis

2019

Genomic differences range from single nucleotide differences to complex structural variations. Current methods typically annotate sequence differences ranging from SNPs to large indels accurately but do not unravel the full complexity of structural rearrangements, including inversions, translocations, and duplications, where highly similar sequence changes in location, orientation, or copy number. Here, we present SyRI, a pairwise whole-genome comparison tool for chromosome-level assemblies. SyRI starts by finding rearranged regions and then searches for differences in the sequences, which are distinguished for residing in syntenic or rearranged regions. This distinction is important as rearranged regions are inherited differently compared to syntenic regions.

Journal Article

Share this book

Add to My Shelf

AnchorWave

by Song, Baoxing , Johnson, Lynn , Buckler, Edward S. in Alignment , Binding sites , Biological Sciences

2022

Millions of species are currently being sequenced, and their genomes are being compared. Many of them have more complex genomes than model systems and raise novel challenges for genome alignment. Widely used local alignment strategies often produce limited or incongruous results when applied to genomes with dispersed repeats, long indels, and highly diverse sequences. Moreover, alignment using many-to-many or reciprocal best hit approaches conflicts with well-studied patterns between species with different rounds of whole-genome duplication. Here, we introduce Anchored Wavefront alignment (AnchorWave), which performs whole-genome duplication–informed collinear anchor identification between genomes and performs base pair–resolved global alignment for collinear blocks using a two-piece affine gap cost strategy. This strategy enables AnchorWave to precisely identify multikilobase indels generated by transposable element (TE) presence/absence variants (PAVs). When aligning two maize genomes, AnchorWave successfully recalled 87% of previously reported TE PAVs. By contrast, other genome alignment tools showed low power for TE PAV recall. AnchorWave precisely aligns up to three times more of the genome as position matches or indels than the closest competitive approach when comparing diverse genomes. Moreover, AnchorWave recalls transcription factor–binding sites at a rate of 1.05- to 74.85-fold higher than other tools with significantly lower false-positive alignments. AnchorWave complements available genome alignment tools by showing obvious improvement when applied to genomes with dispersed repeats, active TEs, high sequence diversity, and wholegenome duplication variation.

Journal Article

Share this book

Add to My Shelf

Assembly and comparison of two closely related brassica napus genomes

by Institut de biologie systémique et synthétique (ISSB) ; Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS) , The University of Western Australia (UWA) , Yuan, Yuxuan in Analysis , Annotations , Artefacts

2017

As an increasing number of plant genome sequences become available, it is clear that gene content varies between individuals, and the challenge arises to predict the gene content of a species. However, genome comparison is often confounded by variation in assembly and annotation. Differentiating between true gene absence and variation in assembly or annotation is essential for the accurate identification of conserved and variable genes in a species. Here, we present the de novo assembly of the B.napus cultivar Tapidor and comparison with an improved assembly of the Brassicanapus cultivar Darmor-bzh. Both cultivars were annotated using the same method to allow comparison of gene content. We identified genes unique to each cultivar and differentiate these from artefacts due to variation in the assembly and annotation. We demonstrate that using a common annotation pipeline can result in different gene predictions, even for closely related cultivars, and repeat regions which collapse during assembly impact whole genome comparison. After accounting for differences in assembly and annotation, we demonstrate that the genome of Darmor-bzh contains a greater number of genes than the genome of Tapidor. Our results are the first step towards comparison of the true differences between B.napus genomes and highlight the potential sources of error in future production of a B.napus pangenome.

Journal Article

Share this book

Add to My Shelf

ACMGA: a reference-free multiple-genome alignment pipeline for plant species

by Song, Baoxing , Zhou, Huafeng , Su, Xiaoquan in Algorithms , Alignment , Analysis

2024

Background The short-read whole-genome sequencing (WGS) approach has been widely applied to investigate the genomic variation in the natural populations of many plant species. With the rapid advancements in long-read sequencing and genome assembly technologies, high-quality genome sequences are available for a group of varieties for many plant species. These genome sequences are expected to help researchers comprehensively investigate any type of genomic variants that are missed by the WGS technology. However, multiple genome alignment (MGA) tools designed by the human genome research community might be unsuitable for plant genomes. Results To fill this gap, we developed the AnchorWave-Cactus Multiple Genome Alignment (ACMGA) pipeline, which improved the alignment of repeat elements and could identify long (> 50 bp) deletions or insertions (INDELs). We conducted MGA using ACMGA and Cactus for 8 Arabidopsis ( Arabidopsis thaliana ) and 26 Maize ( Zea mays ) de novo assembled genome sequences and compared them with the previously published short-read variant calling results. MGA identified more single nucleotide variants (SNVs) and long INDELs than did previously published WGS variant callings. Additionally, ACMGA detected significantly more SNVs and long INDELs in repetitive regions and the whole genome than did Cactus. Compared with the results of Cactus, the results of ACMGA were more similar to the previously published variants called using short-read. These two MGA pipelines identified numerous multi-allelic variants that were missed by the WGS variant calling pipeline. Conclusions Aligning d e novo assembled genome sequences could identify more SNVs and INDELs than mapping short-read. ACMGA combines the advantages of AnchorWave and Cactus and offers a practical solution for plant MGA by integrating global alignment, a 2-piece-affine-gap cost strategy, and the progressive MGA algorithm.

Journal Article

Share this book

Add to My Shelf

Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions

by Kim, Sung-Hou , Sims, Gregory E , Wu, Guohong A in Alphabets , Biological Sciences , Genes

2009

For comparison of whole-genome (genic + nongenic) sequences, multiple sequence alignment of a few selected genes is not appropriate. One approach is to use an alignment-free method in which feature (or l-mer) frequency profiles (FFP) of whole genomes are used for comparison--a variation of a text or book comparison method, using word frequency profiles. In this approach it is critical to identify the optimal resolution range of l-mers for the given set of genomes compared. The optimum FFP method is applicable for comparing whole genomes or large genomic regions even when there are no common genes with high homology. We outline the method in 3 stages: (i) We first show how the optimal resolution range can be determined with English books which have been transformed into long character strings by removing all punctuation and spaces. (ii) Next, we test the robustness of the optimized FFP method at the nucleotide level, using a mutation model with a wide range of base substitutions and rearrangements. (iii) Finally, to illustrate the utility of the method, phylogenies are reconstructed from concatenated mammalian intronic genomes; the FFP derived intronic genome topologies for each l within the optimal range are all very similar. The topology agrees with the established mammalian phylogeny revealing that intron regions contain a similar level of phylogenic signal as do coding regions.

Journal Article

Share this book

Add to My Shelf

Sequencing and Analysis of Complete Chloroplast Genomes Provide Insight into the Evolution and Phylogeny of Chinese Kale (Brassica oleracea var. alboglabra)

by Li, Mengyao , Sun, Bo , Zhang, Chenlu in Amino acids , Analysis , Biosynthesis

2023

Chinese kale is a widely cultivated plant in the genus Brassica in the family Brassicaceae. The origin of Brassica has been studied extensively, but the origin of Chinese kale remains unclear. In contrast to Brassica oleracea, which originated in the Mediterranean region, Chinese kale originated in southern China. The chloroplast genome is often used for phylogenetic analysis because of its high conservatism. Fifteen pairs of universal primers were used to amplify the chloroplast genomes of white-flower Chinese kale (Brassica oleracea var. alboglabra cv. Sijicutiao (SJCT)) and yellow-flower Chinese kale (Brassica oleracea var. alboglabra cv. Fuzhouhuanghua (FZHH)) via PCR. The lengths of the chloroplast genomes were 153,365 bp (SJCT) and 153,420 bp (FZHH) and both contained 87 protein-coding genes and eight rRNA genes. There were 36 tRNA genes in SJCT and 35 tRNA genes in FZHH. The chloroplast genomes of both Chinese kale varieties, along with eight other Brassicaceae, were analyzed. Simple sequence repeats, long repeats, and variable regions of DNA barcodes were identified. An analysis of inverted repeat boundaries, relative synonymous codon usage, and synteny revealed high similarity among the ten species, albeit the slight differences that were observed. The Ka/Ks ratios and phylogenetic analysis suggest that Chinese kale is a variant of B. oleracea. The phylogenetic tree shows that both Chinese kale varieties and B. oleracea var. oleracea were clustered in a single group. The results of this study suggest that white and yellow flower Chinese kale comprise a monophyletic group and that their differences in flower color arose late in the process of artificial cultivation. Our results also provide data that will aid future research on genetics, evolution, and germplasm resources of Brassicaceae.

Journal Article

Share this book

Add to My Shelf

Investigating the impact of reference assembly choice on genomic analyses in a cattle breed

by Bhati, Meenu , Fries, Ruedi , Lloret-Villas, Audald in Accuracy , Alignment quality , Animal Genetics and Genomics

2021

Background Reference-guided read alignment and variant genotyping are prone to reference allele bias, particularly for samples that are greatly divergent from the reference genome. A Hereford-based assembly is the widely accepted bovine reference genome. Haplotype-resolved genomes that exceed the current bovine reference genome in quality and continuity have been assembled for different breeds of cattle. Using whole genome sequencing data of 161 Brown Swiss cattle, we compared the accuracy of read mapping and sequence variant genotyping as well as downstream genomic analyses between the bovine reference genome (ARS-UCD1.2) and a highly continuous Angus-based assembly (UOA_Angus_1). Results Read mapping accuracy did not differ notably between the ARS-UCD1.2 and UOA_Angus_1 assemblies. We discovered 22,744,517 and 22,559,675 high-quality variants from ARS-UCD1.2 and UOA_Angus_1, respectively. The concordance between sequence- and array-called genotypes was high and the number of variants deviating from Hardy-Weinberg proportions was low at segregating sites for both assemblies. More artefactual INDELs were genotyped from UOA_Angus_1 than ARS-UCD1.2 alignments. Using the composite likelihood ratio test, we detected 40 and 33 signatures of selection from ARS-UCD1.2 and UOA_Angus_1, respectively, but the overlap between both assemblies was low. Using the 161 sequenced Brown Swiss cattle as a reference panel, we imputed sequence variant genotypes into a mapping cohort of 30,499 cattle that had microarray-derived genotypes using a two-step imputation approach. The accuracy of imputation (Beagle R 2 ) was very high (0.87) for both assemblies. Genome-wide association studies between imputed sequence variant genotypes and six dairy traits as well as stature produced almost identical results from both assemblies. Conclusions The ARS-UCD1.2 and UOA_Angus_1 assemblies are suitable for reference-guided genome analyses in Brown Swiss cattle. Although differences in read mapping and genotyping accuracy between both assemblies are negligible, the choice of the reference genome has a large impact on detecting signatures of selection that already reached fixation using the composite likelihood ratio test. We developed a workflow that can be adapted and reused to compare the impact of reference genomes on genome analyses in various breeds, populations and species.

Journal Article

Share this book

Add to My Shelf

The Complete Chloroplast Genome of a Key Ancestor of Modern Roses, Rosa chinensis var. spontanea, and a Comparison with Congeneric Species

by Wang, Qi-Gang , Jian, Hong-Ying , Zhang, Shu-Dong in Base Composition , Biological Evolution , China

2018

Rosa chinensis var. spontanea, an endemic and endangered plant of China, is one of the key ancestors of modern roses and a source for famous traditional Chinese medicines against female diseases, such as irregular menses and dysmenorrhea. In this study, the complete chloroplast (cp) genome of R. chinensis var. spontanea was sequenced, analyzed, and compared to congeneric species. The cp genome of R. chinensis var. spontanea is a typical quadripartite circular molecule of 156,590 bp in length, including one large single copy (LSC) region of 85,910 bp and one small single copy (SSC) region of 18,762 bp, separated by two inverted repeat (IR) regions of 25,959 bp. The GC content of the whole genome is 37.2%, while that of LSC, SSC, and IR is 42.8%, 35.2% and 31.2%, respectively. The genome encodes 129 genes, including 84 protein-coding genes (PCGs), 37 transfer RNA (tRNA) genes, and eight ribosomal RNA (rRNA) genes. Seventeen genes in the IR regions were found to be duplicated. Thirty-three forward and five inverted repeats were detected in the cp genome of R. chinensis var. spontanea. The genome is rich in SSRs. In total, 85 SSRs were detected. A genome comparison revealed that IR contraction might be the reason for the relatively smaller cp genome size of R. chinensis var. spontanea compared to other congeneric species. Sequence analysis revealed that the LSC and SSC regions were more divergent than the IR regions within the genus Rosa and that a higher divergence occurred in non-coding regions than in coding regions. A phylogenetic analysis showed that the sampled species of the genus Rosa formed a monophyletic clade and that R. chinensis var. spontanea shared a more recent ancestor with R. lichiangensis of the section Synstylae than with R. odorata var. gigantea of the section Chinenses. This information will be useful for the conservation genetics of R. chinensis var. spontanea and for the phylogenetic study of the genus Rosa, and it might also facilitate the genetics and breeding of modern roses.

Journal Article

Share this book

Add to My Shelf

Comparative genomics uncovers the prolific and distinctive metabolic potential of the cyanobacterial genus Moorea

by Gerwick, William H. , Allen, Eric E. , Leao, Tiago in Bacteria , Biological Sciences , Comparative analysis

2017

Cyanobacteria are major sources of oxygen, nitrogen, and carbon in nature. In addition to the importance of their primary metabolism, some cyanobacteria are prolific producers of unique and bioactive secondary metabolites. Chemical investigations of the cyanobacterial genus Moorea have resulted in the isolation of over 190 compounds in the last two decades. However, preliminary genomic analysis has suggested that genome-guided approaches can enable the discovery of novel compounds from even well-studied Moorea strains, highlighting the importance of obtaining complete genomes. We report a complete genome of a filamentous tropical marine cyanobacterium, Moorea producens PAL, which reveals that about one-fifth of its genome is devoted to production of secondary metabolites, an impressive four times the cyanobacterial average. Moreover, possession of the complete PAL genome has allowed improvement to the assembly of three other Moorea draft genomes. Comparative genomics revealed that they are remarkably similar to one another, despite their differences in geography, morphology, and secondary metabolite profiles. Gene cluster networking highlights that this genus is distinctive among cyanobacteria, not only in the number of secondary metabolite pathways but also in the content of many pathways, which are potentially distinct from all other bacterial gene clusters to date. These findings portend that future genome-guided secondary metabolite discovery and isolation efforts should be highly productive.

Journal Article

Share this book

Add to My Shelf

Genome Mining of Pseudomonas Species: Diversity and Evolution of Metabolic and Biosynthetic Potential

by Zhang, Youming , Li, Ruijuan , Yu, Guangle in Algorithms , Antibiotics , Antimicrobial agents

2021

Microbial genome sequencing has uncovered a myriad of natural products (NPs) that have yet to be explored. Bacteria in the genus Pseudomonas serve as pathogens, plant growth promoters, and therapeutically, industrially, and environmentally important microorganisms. Though most species of Pseudomonas have a large number of NP biosynthetic gene clusters (BGCs) in their genomes, it is difficult to link many of these BGCs with products under current laboratory conditions. In order to gain new insights into the diversity, distribution, and evolution of these BGCs in Pseudomonas for the discovery of unexplored NPs, we applied several bioinformatic programming approaches to characterize BGCs from Pseudomonas reference genome sequences available in public databases along with phylogenetic and genomic comparison. Our research revealed that most BGCs in the genomes of Pseudomonas species have a high diversity for NPs at the species and subspecies levels and built the correlation of species with BGC taxonomic ranges. These data will pave the way for the algorithmic detection of species- and subspecies-specific pathways for NP development.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter