Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
1,325
result(s) for
"DNA sequence clustering"
Sort by:
Review on the Application of Machine Learning Algorithms in the Sequence Data Mining of DNA
by
Han, Yang
,
Zhang, Limin
,
Yang, Aimin
in
Algorithms
,
Artificial intelligence
,
Bioengineering and Biotechnology
2020
Deoxyribonucleic acid (DNA) is a biological macromolecule. Its main function is information storage. At present, the advancement of sequencing technology had caused DNA sequence data to grow at an explosive rate, which has also pushed the study of DNA sequences in the wave of big data. Moreover, machine learning is a powerful technique for analyzing largescale data and learns spontaneously to gain knowledge. It has been widely used in DNA sequence data analysis and obtained a lot of research achievements. Firstly, the review introduces the development process of sequencing technology, expounds on the concept of DNA sequence data structure and sequence similarity. Then we analyze the basic process of data mining, summary several major machine learning algorithms, and put forward the challenges faced by machine learning algorithms in the mining of biological sequence data and possible solutions in the future. Then we review four typical applications of machine learning in DNA sequence data: DNA sequence alignment, DNA sequence classification, DNA sequence clustering, and DNA pattern mining. We analyze their corresponding biological application background and significance, and systematically summarized the development and potential problems in the field of DNA sequence data mining in recent years. Finally, we summarize the content of the review and look into the future of some research directions for the next step.
Journal Article
CGRclust: Chaos Game Representation for twin contrastive clustering of unlabelled DNA sequences
by
Alipour, Fatemeh
,
Kari, Lila
,
Hill, Kathleen A.
in
Algorithms
,
Alignment
,
Alignment-free DNA sequence comparison
2024
Background
Traditional supervised learning methods applied to DNA sequence taxonomic classification rely on the labor-intensive and time-consuming step of labelling the primary DNA sequences. Additionally, standard DNA classification/clustering methods involve time-intensive multiple sequence alignments, which impacts their applicability to large genomic datasets or distantly related organisms. These limitations indicate a need for robust, efficient, and scalable unsupervised DNA sequence clustering methods that do not depend on sequence labels or alignment.
Results
This study proposes CGRclust, a novel combination of unsupervised twin contrastive clustering of Chaos Game Representations (CGR) of DNA sequences, with convolutional neural networks (CNNs). To the best of our knowledge, CGRclust is the first method to use unsupervised learning for image classification (herein applied to two-dimensional CGR images) for clustering datasets of DNA sequences. CGRclust overcomes the limitations of traditional sequence classification methods by leveraging unsupervised twin contrastive learning to detect distinctive sequence patterns, without requiring DNA sequence alignment or biological/taxonomic labels. CGRclust accurately clustered twenty-five diverse datasets, with sequence lengths ranging from 664 bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as well as viral whole genome assemblies and synthetic DNA sequences. Compared with three recent clustering methods for DNA sequences (DeLUCS,
i
DeLUCS, and MeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy across all four taxonomic levels tested for mitochondrial DNA genomes of fish. Moreover, CGRclust also consistently demonstrates superior performance across all the viral genomic datasets. The high clustering accuracy of CGRclust on these twenty-five datasets, which vary significantly in terms of sequence length, number of genomes, number of clusters, and level of taxonomy, demonstrates its robustness, scalability, and versatility.
Conclusion
CGRclust is a novel, scalable, alignment-free DNA sequence clustering method that uses CGR images of DNA sequences and CNNs for twin contrastive clustering of unlabelled primary DNA sequences, achieving superior or comparable accuracy and performance over current approaches. CGRclust demonstrated enhanced reliability, by consistently achieving over 80% accuracy in more than 90% of the datasets analyzed. In particular, CGRclust performed especially well in clustering viral DNA datasets, where it consistently outperformed all competing methods.
Journal Article
A Universal Probe Set for Targeted Sequencing of 353 Nuclear Genes from Any Flowering Plant Designed Using k-Medoids Clustering
by
Botigué, Laura R.
,
Forest, Félix
,
Maurin, Olivier
in
Angiospermae
,
Angiosperms
,
Cluster Analysis
2019
Sequencing of target-enriched libraries is an efficient and cost-effective method for obtaining DNA sequence data from hundreds of nuclear loci for phylogeny reconstruction. Much of the cost of developing targeted sequencing approaches is associated with the generation of preliminary data needed for the identification of orthologous loci for probe design. In plants, identifying orthologous loci has proven difficult due to a large number of whole-genome duplication events, especially in the angiosperms (flowering plants). We used multiple sequence alignments from over 600 angiosperms for 353 putatively single-copy protein-coding genes identified by the One Thousand Plant Transcriptomes Initiative to design a set of targeted sequencing probes for phylogenetic studies of any angiosperm group. To maximize the phylogenetic potential of the probes, while minimizing the cost of production, we introduce a k-medoids clustering approach to identify the minimum number of sequences necessary to represent each coding sequence in the final probe set. Using this method, 5–15 representative sequences were selected per orthologous locus, representing the sequence diversity of angiosperms more efficiently than if probes were designed using available sequenced genomes alone. To test our approximately 80,000 probes, we hybridized libraries from 42 species spanning all higher-order groups of angiosperms, with a focus on taxa not present in the sequence alignments used to design the probes. Out of a possible 353 coding sequences, we recovered an average of 283 per species and at least 100 in all species. Differences among taxa in sequence recovery could not be explained by relatedness to the representative taxa selected for probe design, suggesting that there is no phylogenetic bias in the probe set. Our probe set, which targeted 260 kbp of coding sequence, achieved a median recovery of 137 kbp per taxon in coding regions, a maximum recovery of 250 kbp, and an additional median of 212 kbp per taxon in flanking non-coding regions across all species. These results suggest that the Angiosperms353 probe set described here is effective for any group of flowering plants and would be useful for phylogenetic studies from the species level to higher-order groups, including the entire angiosperm clade itself.
Journal Article
RESCRIPt: Reproducible sequence taxonomy reference database management
by
Robeson, Michael S.
,
Bokulich, Nicholas A.
,
Dillon, Matthew R.
in
Animals
,
Biology and Life Sciences
,
Classification
2021
Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt .
Journal Article
Genome sequence of the progenitor of wheat A subgenome Triticum urartu
2018
Triticum urartu
(diploid, AA) is the progenitor of the A subgenome of tetraploid (
Triticum turgidum
, AABB) and hexaploid (
Triticum aestivum
, AABBDD) wheat
1
,
2
. Genomic studies of
T. urartu
have been useful for investigating the structure, function and evolution of polyploid wheat genomes. Here we report the generation of a high-quality genome sequence of
T. urartu
by combining bacterial artificial chromosome (BAC)-by-BAC sequencing, single molecule real-time whole-genome shotgun sequencing
3
, linked reads and optical mapping
4
,
5
. We assembled seven chromosome-scale pseudomolecules and identified protein-coding genes, and we suggest a model for the evolution of
T. urartu
chromosomes. Comparative analyses with genomes of other grasses showed gene loss and amplification in the numbers of transposable elements in the
T. urartu
genome. Population genomics analysis of 147
T. urartu
accessions from across the Fertile Crescent showed clustering of three groups, with differences in altitude and biostress, such as powdery mildew disease. The
T. urartu
genome assembly provides a valuable resource for studying genetic variation in wheat and related grasses, and promises to facilitate the discovery of genes that could be useful for wheat improvement.
The genome sequence of
Triticum urartu
, the progenitor of the A subgenome of hexaploid wheat, provides insight into genome duplication during grass evolution.
Journal Article
Three-dimensional intact-tissue sequencing of single-cell transcriptional states
2018
RNA sequencing samples the entire transcriptome but lacks anatomical information. In situ hybridization, on the other hand, can only profile a small number of transcripts. In situ sequencing technologies address these shortcomings but face a challenge in dense, complex tissue environments. Wang et al. combined an efficient sequencing approach with hydrogel-tissue chemistry to develop a multidisciplinary technology for three-dimensional (3D) intact-tissue RNA sequencing (see the Perspective by Knöpfel). More than 1000 genes were simultaneously mapped in sections of mouse brain at single-cell resolution to define cell types and circuit states and to reveal cell organization principles. Science , this issue p. eaat5691 ; see also p. 328 Wang et al . describe the development and application of an RNA sequencing technology to define cell types and circuit states in the mouse brain. Retrieving high-content gene-expression information while retaining three-dimensional (3D) positional anatomy at cellular resolution has been difficult, limiting integrative understanding of structure and function in complex biological tissues. We developed and applied a technology for 3D intact-tissue RNA sequencing, termed STARmap (spatially-resolved transcript amplicon readout mapping), which integrates hydrogel-tissue chemistry, targeted signal amplification, and in situ sequencing. The capabilities of STARmap were tested by mapping 160 to 1020 genes simultaneously in sections of mouse brain at single-cell resolution with high efficiency, accuracy, and reproducibility. Moving to thick tissue blocks, we observed a molecularly defined gradient distribution of excitatory-neuron subtypes across cubic millimeter–scale volumes (>30,000 cells) and a short-range 3D self-clustering in many inhibitory-neuron subtypes that could be identified and described with 3D STARmap.
Journal Article
Inferring Phylogenies from RAD Sequence Data
2012
Reduced-representation genome sequencing represents a new source of data for systematics, and its potential utility in interspecific phylogeny reconstruction has not yet been explored. One approach that seems especially promising is the use of inexpensive short-read technologies (e.g., Illumina, SOLiD) to sequence restriction-site associated DNA (RAD)--the regions of the genome that flank the recognition sites of restriction enzymes. In this study, we simulated the collection of RAD sequences from sequenced genomes of different taxa (Drosophila, mammals, and yeasts) and developed a proof-of-concept workflow to test whether informative data could be extracted and used to accurately reconstruct \"known\" phylogenies of species within each group. The workflow consists of three basic steps: first, sequences are clustered by similarity to estimate orthology; second, clusters are filtered by taxonomic coverage; and third, they are aligned and concatenated for \"total evidence\" phylogenetic analysis. We evaluated the performance of clustering and filtering parameters by comparing the resulting topologies with well-supported reference trees and we were able to identify conditions under which the reference tree was inferred with high support. For Drosophila, whole genome alignments allowed us to directly evaluate which parameters most consistently recovered orthologous sequences. For the parameter ranges explored, we recovered the best results at the low ends of sequence similarity and taxonomic representation of loci; these generated the largest supermatrices with the highest proportion of missing data. Applications of the method to mammals and yeasts were less successful, which we suggest may be due partly to their much deeper evolutionary divergence times compared to Drosophila (crown ages of approximately 100 and 300 versus 60 Mya, respectively). RAD sequences thus appear to hold promise for reconstructing phylogenetic relationships in younger clades in which sufficient numbers of orthologous restriction sites are retained across species.
Journal Article
Clustering huge protein sequence sets in linear time
2018
Metagenomic datasets contain billions of protein sequences that could greatly enhance large-scale functional annotation and structure prediction. Utilizing this enormous resource would require reducing its redundancy by similarity clustering. However, clustering hundreds of millions of sequences is impractical using current algorithms because their runtimes scale as the input set size
N
times the number of clusters
K
, which is typically of similar order as
N
, resulting in runtimes that increase almost quadratically with
N
. We developed Linclust, the first clustering algorithm whose runtime scales as
N
, independent of
K
. It can also cluster datasets several times larger than the available main memory. We cluster 1.6 billion metagenomic sequence fragments in 10 h on a single server to 50% sequence identity, >1000 times faster than has been possible before. Linclust will help to unlock the great wealth contained in metagenomic and genomic sequence databases.
Billions of metagenomic and genomic sequences fill up public datasets, which makes similarity clustering an important and time-critical analysis step. Here, the authors develop Linclust, an algorithm with linear time complexity that can cluster over a billion sequences within hours on a single server.
Journal Article
Chromatin-accessibility estimation from single-cell ATAC-seq data with scOpen
by
Kramann, Rafael
,
Zenke, Martin
,
Costa, Ivan G.
in
631/114/1305
,
631/114/2114
,
692/4022/1585/3182
2021
A major drawback of single-cell ATAC-seq (scATAC-seq) is its sparsity, i.e., open chromatin regions with no reads due to loss of DNA material during the scATAC-seq protocol. Here, we propose scOpen, a computational method based on regularized non-negative matrix factorization for imputing and quantifying the open chromatin status of regulatory regions from sparse scATAC-seq experiments. We show that scOpen improves crucial downstream analysis steps of scATAC-seq data as clustering, visualization,
cis
-regulatory DNA interactions, and delineation of regulatory features. We demonstrate the power of scOpen to dissect regulatory changes in the development of fibrosis in the kidney. This identifies a role of Runx1 and target genes by promoting fibroblast to myofibroblast differentiation driving kidney fibrosis.
scATAC-Seq yields data that is extremely sparse. Here, the authors present a computationally efficient imputation method called scOpen that improves the downstream analyses of scATAC-Seq data and use it to identify transcriptional regulators of kidney fibrosis.
Journal Article
Study of the error correction capability of multiple sequence alignment algorithm (MAFFT) in DNA storage
2023
Synchronization (insertions–deletions) errors are still a major challenge for reliable information retrieval in DNA storage. Unlike traditional error correction codes (ECC) that add redundancy in the stored information, multiple sequence alignment (MSA) solves this problem by searching the conserved subsequences. In this paper, we conduct a comprehensive simulation study on the error correction capability of a typical MSA algorithm, MAFFT. Our results reveal that its capability exhibits a phase transition when there are around 20% errors. Below this critical value, increasing sequencing depth can eventually allow it to approach complete recovery. Otherwise, its performance plateaus at some poor levels. Given a reasonable sequencing depth (≤ 70), MSA could achieve complete recovery in the low error regime, and effectively correct 90% of the errors in the medium error regime. In addition, MSA is robust to imperfect clustering. It could also be combined with other means such as ECC, repeated markers, or any other code constraints. Furthermore, by selecting an appropriate sequencing depth, this strategy could achieve an optimal trade-off between cost and reading speed. MSA could be a competitive alternative for future DNA storage.
Journal Article