Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
741
result(s) for
"Sequence comparison"
Sort by:
Benchmarking of alignment-free sequence comparison methods
by
Zielezinski, Andrzej
,
Bernard, Guillaume
,
Kim, Sung-Hou
in
Algorithms
,
Alignment-free
,
Amino acid sequence
2019
Background
Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment.
Results
Here, we present a community resource (
http://afproject.org
) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events.
Conclusion
The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.
Journal Article
Progress in quickly finding orthologs as reciprocal best hits: comparing blast, last, diamond and MMseqs2
by
Hernández-Salmerón, Julie E.
,
Moreno-Hagelsieb, Gabriel
in
Algorithms
,
Amino Acid Sequence
,
Animal Genetics and Genomics
2020
Background
Finding orthologs remains an important bottleneck in comparative genomics analyses. While the authors of software for the quick comparison of protein sequences evaluate the speed of their software and compare their results against the most usual software for the task, it is not common for them to evaluate their software for more particular uses, such as finding orthologs as reciprocal best hits (RBH). Here we compared RBH results obtained using software that runs faster than blastp. Namely, lastal, diamond, and MMseqs2.
Results
We found that lastal required the least time to produce results. However, it yielded fewer results than any other program when comparing the proteins encoded by evolutionarily distant genomes. The program producing the most similar number of RBH to blastp was diamond ran with the “ultra-sensitive” option. However, this option was diamond’s slowest, with the “very-sensitive” option offering the best balance between speed and RBH results. The speeding up of the programs was much more evident when dealing with eukaryotic genomes, which code for more numerous proteins. For example, lastal took a median of approx. 1.5% of the blastp time to run with bacterial proteomes and 0.6% with eukaryotic ones, while diamond with the very-sensitive option took 7.4% and 5.2%, respectively. Though estimated error rates were very similar among the RBH obtained with all programs, RBH obtained with MMseqs2 had the lowest error rates among the programs tested.
Conclusions
The fast algorithms for pairwise protein comparison produced results very similar to blast in a fraction of the time, with diamond offering the best compromise in speed, sensitivity and quality, as long as a sensitivity option, other than the default, was chosen.
Journal Article
Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis
by
Lin, Chengqi
,
Jin, Wenfei
,
Lin, Yanling
in
Accuracy
,
Algorithms
,
Animal Genetics and Genomics
2021
Here, we develop k -mer substring space decomposition (Kssd), a sketching technique which is significantly faster and more accurate than current sketching methods. We show that it is the only method that can be used for large-scale dataset comparisons at population resolution on simulated and real data. Using Kssd, we prioritize references for all 1,019,179 bacteria whole genome sequencing (WGS) runs from NCBI Sequence Read Archive and find misidentification or contamination in 6164 of these. Additionally, we analyze WGS and exome runs of samples from the 1000 Genomes Project.
Journal Article
CGRclust: Chaos Game Representation for twin contrastive clustering of unlabelled DNA sequences
by
Alipour, Fatemeh
,
Kari, Lila
,
Hill, Kathleen A.
in
Algorithms
,
Alignment
,
Alignment-free DNA sequence comparison
2024
Background
Traditional supervised learning methods applied to DNA sequence taxonomic classification rely on the labor-intensive and time-consuming step of labelling the primary DNA sequences. Additionally, standard DNA classification/clustering methods involve time-intensive multiple sequence alignments, which impacts their applicability to large genomic datasets or distantly related organisms. These limitations indicate a need for robust, efficient, and scalable unsupervised DNA sequence clustering methods that do not depend on sequence labels or alignment.
Results
This study proposes CGRclust, a novel combination of unsupervised twin contrastive clustering of Chaos Game Representations (CGR) of DNA sequences, with convolutional neural networks (CNNs). To the best of our knowledge, CGRclust is the first method to use unsupervised learning for image classification (herein applied to two-dimensional CGR images) for clustering datasets of DNA sequences. CGRclust overcomes the limitations of traditional sequence classification methods by leveraging unsupervised twin contrastive learning to detect distinctive sequence patterns, without requiring DNA sequence alignment or biological/taxonomic labels. CGRclust accurately clustered twenty-five diverse datasets, with sequence lengths ranging from 664 bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as well as viral whole genome assemblies and synthetic DNA sequences. Compared with three recent clustering methods for DNA sequences (DeLUCS,
i
DeLUCS, and MeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy across all four taxonomic levels tested for mitochondrial DNA genomes of fish. Moreover, CGRclust also consistently demonstrates superior performance across all the viral genomic datasets. The high clustering accuracy of CGRclust on these twenty-five datasets, which vary significantly in terms of sequence length, number of genomes, number of clusters, and level of taxonomy, demonstrates its robustness, scalability, and versatility.
Conclusion
CGRclust is a novel, scalable, alignment-free DNA sequence clustering method that uses CGR images of DNA sequences and CNNs for twin contrastive clustering of unlabelled primary DNA sequences, achieving superior or comparable accuracy and performance over current approaches. CGRclust demonstrated enhanced reliability, by consistently achieving over 80% accuracy in more than 90% of the datasets analyzed. In particular, CGRclust performed especially well in clustering viral DNA datasets, where it consistently outperformed all competing methods.
Journal Article
Characterization and Classification of LMW-GS Genes at the Glu-3 Locus of Bread Wheat
2025
Low Molecular Weight Glutenin Subunits (LMW-GS) proteins have great effects on the end-use quality of bread wheat and are difficult to differentiate directly. It is very important to characterize and classify LMW-GS genes systematically. In this paper, 692 complete Glu-3 gene sequences were retrieved from GenBank and were grouped based on their sequence characters and variations. Based on the characters of their N-terminal sequences, these genes were classified into two types, LMW-m and LMW-i, of which LMW-m genes were further classified into three sub-types based on their first amino acid (AA) (LMW-M, LMW-V and LMW-I). Based on the first seven or eight AA variations in the N-terminal sequence, LMW-GS Glu-3 genes were classified into 16 types, namely LMW-N1 to LMW-N16. Based on the last 10 AA variations in the C-terminal, the Glu-3 genes were classified into 22 types, designated as LMW-C1 to LMW-C22. Based on the number and distribution of cysteines, the Glu-3 genes classified into 22 types included 7 conventional types with eight cysteines and 15 variant types with seven or nine cysteines. In addition, two new Glu-A3 genes (GluA-10 and GluA-11) were identified based on their sequence homology, and the connection between different classification methods was analyzed briefly. The results provide insight into the nature of the Glu-3 gene family and are valuable for molecular marker-assisted selection of end-use quality traits in wheat improvement.
Journal Article
Biochemical Property Based Positional Matrix: A New Approach Towards Genome Sequence Comparison
2023
The growth of the genome sequence has become one of the emerging areas in the study of bioinformatics. It has led to an excessive demand for researchers to develop advanced methodologies for evolutionary relationships among species. The alignment-free methods have been proved to be more efficient and appropriate related to time and space than existing alignment-based methods for sequence analysis. In this study, a new alignment-free genome sequence comparison technique is proposed based on the biochemical properties of nucleotides. Each genome sequence can be distributed in four parameters to represent a 21-dimensional numerical descriptor using the Positional Matrix. To substantiate the proposed method, phylogenetic trees are constructed on the viral and mammalian datasets by applying the UPGMA/NJ clustering method. Further, the results of this method are compared with the results of the Feature Frequency Profiles method, the Positional Correlation Natural Vector method, the Graph-theoretic method, the Multiple Encoding Vector method, and the Fuzzy Integral Similarity method. In most cases, it is found that the present method produces more accurate results than the prior methods. Also, in the present method, the execution time for computation is comparatively small.
Journal Article
S-conLSH: alignment-free gapped mapping of noisy long reads
by
Chakraborty, Angana
,
Bandyopadhyay, Sanghamitra
,
Morgenstern, Burkhard
in
Algorithms
,
Alignment
,
Alignment-free sequence comparison
2021
Background
The advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMRT sequencing. However, the state-of-the-art aligners often fail to identify distant homologies due to lack of conserved regions, caused by frequent genetic duplication and recombination. Therefore, we developed a novel alignment-free method of sequence mapping that is fast and accurate.
Results
We present a new mapper called S-conLSH that uses
S
paced
con
text based
L
ocality
S
ensitive
H
ashing. With multiple spaced patterns, S-conLSH facilitates a gapped mapping of noisy long reads to the corresponding target locations of a reference genome. We have examined the performance of the proposed method on 5 different real and simulated datasets. S-conLSH is at least 2 times faster than the recently developed method lordFAST. It achieves a sensitivity of 99%, without using any traditional base-to-base alignment, on human simulated sequence data. By default, S-conLSH provides an alignment-free mapping in PAF format. However, it has an option of generating aligned output as SAM-file, if it is required for any downstream processing.
Conclusions
S-conLSH is one of the first alignment-free reference genome mapping tools achieving a high level of sensitivity. The
spaced
-context is especially suitable for extracting distant similarities. The variable-length spaced-seeds or patterns add flexibility to the proposed algorithm by introducing gapped mapping of the noisy long reads. Therefore, S-conLSH may be considered as a prominent direction towards alignment-free sequence analysis.
Journal Article
Assembly, annotation and analysis of the chloroplast genome of the Algarrobo tree Neltuma pallida (subfamily: Caesalpinioideae)
2023
Background
Neltuma pallida
is a tree that grows in arid soils in northwestern Peru. As a predominant species of the Equatorial Dry Forest ecoregion, it holds significant economic and ecological value for both people and environment. Despite this, the species is severely threatened and there is a lack of genetic and genomic research, hindering the proposal of evidence-based conservation strategies.
Results
In this work, we conducted the assembly, annotation, analysis and comparison of the chloroplast genome of a
N. pallida
specimen with those of related species. The assembled chloroplast genome has a length of 162,381 bp with a typical quadripartite structure (LSC-IRA-SSC-IRB). The calculated GC content was 35.97%. However, this is variable between regions, with a higher GC content observed in the IRs. A total of 132 genes were annotated, of which 19 were duplicates and 22 contained at least one intron in their sequence. A substantial number of repetitive sequences of different types were identified in the assembled genome, predominantly tandem repeats (> 300). In particular, 142 microsatellites (SSR) markers were identified. The phylogenetic reconstruction showed that
N. pallida
grouped with the other
Neltuma
species and with
Prosopis cineraria
. The analysis of sequence divergence between the chloroplast genome sequences of
N. pallida, N. juliflora
,
P. farcta
and
Strombocarpa tamarugo
revealed a high degree of similarity.
Conclusions
The
N. pallida
chloroplast genome was found to be similar to those of closely related species. With a size of 162,831 bp, it had the classical chloroplast quadripartite structure and GC content of 35.97%. Most of the 132 identified genes were protein-coding genes. Additionally, over 800 repetitive sequences were identified, including 142 SSR markers. In the phylogenetic analysis,
N. pallida
grouped with other
Neltuma
spp. and
P. cineraria
. Furthermore,
N. pallida
chloroplast was highly conserved when compared with genomes of closely related species. These findings can be of great potential for further diversity studies and genetic improvement of
N. pallida
.
Journal Article
Information Theory in Computational Biology: Where We Stand Today
by
Sukumar, Shravan
,
Van Hemert, John
,
Chanda, Pritam
in
computational biology
,
entropy
,
gene expression
2020
“A Mathematical Theory of Communication” was published in 1948 by Claude Shannon to address the problems in the field of data compression and communication over (noisy) communication channels. Since then, the concepts and ideas developed in Shannon’s work have formed the basis of information theory, a cornerstone of statistical learning and inference, and has been playing a key role in disciplines such as physics and thermodynamics, probability and statistics, computational sciences and biological sciences. In this article we review the basic information theory based concepts and describe their key applications in multiple major areas of research in computational biology—gene expression and transcriptomics, alignment-free sequence comparison, sequencing and error correction, genome-wide disease-gene association mapping, metabolic networks and metabolomics, and protein sequence, structure and interaction analysis.
Journal Article
A Sequence Variation in GmBADH2 Enhances Soybean Aroma and Is a Functional Marker for Improving Soybean Flavor
2022
The vegetable soybean (Glycine max L. Merr.) plant is commonly consumed in Southeast Asian countries because of its nutritional value and desirable taste. A “pandan-like” aroma is an important value-added quality trait that is rarely found in commercial vegetable soybean varieties. In this study, three novel aromatic soybean cultivars with a fragrant volatile compound were isolated. We confirmed that the aroma of these cultivars is due to the potent volatile compound 2-acetyl-1-pyrroline (2AP) that was previously identified in soybean. A sequence comparison of GmBADH1/2 (encoding an aminoaldehyde dehydrogenase) between aromatic and non-aromatic soybean varieties revealed a mutation with 10 SNPs and an 11-nucleotide deletion in exon 1 of GmBADH2 in Quxian No. 1 and Xiangdou. Additionally, a 2-bp deletion was detected in exon 10 of GmBADH2 in ZK1754. The mutations resulted in a frame shift and the introduction of premature stop codons. Moreover, genetic analyses indicated that the aromatic trait in these three varieties was inherited according to a single recessive gene model. These results suggested that a mutated GmBADH2 may be responsible for the aroma of these three aromatic soybean cultivars. The expression and function of GmBADH2 in aromatic soybean seeds were confirmed by qRT-PCR and CRISPR/Cas9. A functional marker developed on the basis of the mutated GmBADH2 sequence in Quxian No. 1 and Xiangdou was validated in an F2 population. A perfect association between the marker genotypes and aroma phenotypes implied that GmBADH2 is a major aroma-conferring gene. The results of this study are potentially useful for an in-depth analysis of the molecular basis of 2-AP formation in soybean and the marker-assisted breeding of aromatic vegetable soybean cultivars.
Journal Article