Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
65
result(s) for
"Schmidt, Bertil"
Sort by:
CAREx: context-aware read extension of paired-end sequencing data
2024
Background
Commonly used next generation sequencing machines typically produce large amounts of short reads of a few hundred base-pairs in length. However, many downstream applications would generally benefit from longer reads.
Results
We present CAREx—an algorithm for the generation of pseudo-long reads from paired-end short-read Illumina data based on the concept of repeatedly computing multiple-sequence-alignments to extend a read until its partner is found. Our performance evaluation on both simulated data and real data shows that CAREx is able to connect significantly more read pairs (up to
99
%
for simulated data) and to produce more error-free pseudo-long reads than previous approaches. When used prior to assembly it can achieve superior de novo assembly results. Furthermore, the GPU-accelerated version of CAREx exhibits the fastest execution times among all tested tools.
Conclusion
CAREx is a new MSA-based algorithm and software for producing pseudo-long reads from paired-end short read data. It outperforms other state-of-the-art programs in terms of (i) percentage of connected read pairs, (ii) reduction of error rates of filled gaps, (iii) runtime, and (iv) downstream analysis using de novo assembly. CAREx is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at (
https://github.com/fkallen/CAREx
).
Journal Article
CUDASW++4.0: ultra-fast GPU-based Smith–Waterman protein sequence database search
by
Kallenborn, Felix
,
Hundt, Christian
,
Chacon, Alejandro
in
Algorithms
,
Alignment
,
Amino acid sequence
2024
Background
The maximal sensitivity for local pairwise alignment makes the Smith-Waterman algorithm a popular choice for protein sequence database search. However, its quadratic time complexity makes it compute-intensive. Unfortunately, current state-of-the-art software tools are not able to leverage the massively parallel processing capabilities of modern GPUs with close-to-peak performance. This motivates the need for more efficient implementations.
Results
CUDASW++4.0 is a fast software tool for scanning protein sequence databases with the Smith-Waterman algorithm on CUDA-enabled GPUs. Our approach achieves high efficiency for dynamic programming-based alignment computation by minimizing memory accesses and instructions. We provide both efficient matrix tiling, and sequence database partitioning schemes, and exploit next generation floating point arithmetic and novel DPX instructions. This leads to close-to-peak performance on modern GPU generations (Ampere, Ada, Hopper) with throughput rates of up to 1.94 TCUPS, 5.01 TCUPS, 5.71 TCUPS on an A100, L40S, and H100, respectively. Evaluation on the Swiss-Prot, UniRef50, and TrEMBL databases shows that CUDASW++4.0 gains over an order-of-magnitude performance improvements over previous GPU-based approaches (CUDASW++3.0, ADEPT, SW#DB). In addition, our algorithm demonstrates significant speedups over top-performing CPU-based tools (BLASTP, SWIPE, SWIMM2.0), can exploit multi-GPU nodes with linear scaling, and features an impressive energy efficiency of up to 15.7 GCUPS/Watt.
Conclusion
CUDASW++4.0 changes the standing of GPUs in protein sequence database search with Smith-Waterman alignment by providing close-to-peak performance on modern GPUs. It is freely available at
https://github.com/asbschmidt/CUDASW4
.
Journal Article
CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions
2013
Background
The maximal sensitivity for local alignments makes the Smith-Waterman algorithm a popular choice for protein sequence database search based on pairwise alignment. However, the algorithm is compute-intensive due to a quadratic time complexity. Corresponding runtimes are further compounded by the rapid growth of sequence databases.
Results
We present CUDASW++ 3.0, a fast Smith-Waterman protein database search algorithm, which couples CPU and GPU SIMD instructions and carries out concurrent CPU and GPU computations. For the CPU computation, this algorithm employs SSE-based vector execution units as accelerators. For the GPU computation, we have investigated for the first time a GPU SIMD parallelization, which employs CUDA PTX SIMD video instructions to gain more data parallelism beyond the SIMT execution model. Moreover, sequence alignment workloads are automatically distributed over CPUs and GPUs based on their respective compute capabilities. Evaluation on the Swiss-Prot database shows that CUDASW++ 3.0 gains a performance improvement over CUDASW++ 2.0 up to 2.9 and 3.2, with a maximum performance of 119.0 and 185.6 GCUPS, on a single-GPU GeForce GTX 680 and a dual-GPU GeForce GTX 690 graphics card, respectively. In addition, our algorithm has demonstrated significant speedups over other top-performing tools: SWIPE and BLAST+.
Conclusions
CUDASW++ 3.0 is written in CUDA C++ and PTX assembly languages, targeting GPUs based on the Kepler architecture. This algorithm obtains significant speedups over its predecessor: CUDASW++ 2.0, by benefiting from the use of CPU and GPU SIMD instructions as well as the concurrent execution on CPUs and GPUs. The source code and the simulated data are available at
http://cudasw.sourceforge.net
.
Journal Article
CARE 2.0: reducing false-positive sequencing error corrections using machine learning
2022
Background
Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as
k
-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools.
Results
We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0’s hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved
k
-mer analysis show the applicability of CARE 2.0 to real-world data.
Conclusion
False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve
k
-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at
https://github.com/fkallen/CARE
.
Journal Article
CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units
by
Maskell, Douglas L
,
Schmidt, Bertil
,
Liu, Yongchao
in
Algorithms
,
Analysis
,
Biomedical and Life Sciences
2009
Background
The Smith-Waterman algorithm is one of the most widely used tools for searching biological sequence databases due to its high sensitivity. Unfortunately, the Smith-Waterman algorithm is computationally demanding, which is further compounded by the exponential growth of sequence databases. The recent emergence of many-core architectures, and their associated programming interfaces, provides an opportunity to accelerate sequence database searches using commonly available and inexpensive hardware.
Findings
Our CUDASW++ implementation (benchmarked on a single-GPU NVIDIA GeForce GTX 280 graphics card and a dual-GPU GeForce GTX 295 graphics card) provides a significant performance improvement compared to other publicly available implementations, such as SWPS3, CBESW, SW-CUDA, and NCBI-BLAST. CUDASW++ supports query sequences of length up to 59K and for query sequences ranging in length from 144 to 5,478 in Swiss-Prot release 56.6, the single-GPU version achieves an average performance of 9.509 GCUPS with a lowest performance of 9.039 GCUPS and a highest performance of 9.660 GCUPS, and the dual-GPU version achieves an average performance of 14.484 GCUPS with a lowest performance of 10.660 GCUPS and a highest performance of 16.087 GCUPS.
Conclusion
CUDASW++ is publicly available open-source software. It provides a significant performance improvement for Smith-Waterman-based protein sequence database searches by fully exploiting the compute capability of commonly used CUDA-enabled low-cost GPUs.
Journal Article
RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches
by
Schmidt, Bertil
,
Wei, Yanjie
,
Xu, Xiaoming
in
Algorithms
,
Animal Genetics and Genomics
,
Bacteria
2023
We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences from RefSeq, 455 GB in FASTA format, can be clustered within less than 6 min and 1,009,738 GenBank assembled bacterial genomes, 4.0 TB in FASTA format, within only 34 min on a 128-core workstation. Our results further identify 1269 redundant genomes, with identical nucleotide content, in the RefSeq bacterial genomes database.
Journal Article
CUDASW++2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions
by
Maskell, Douglas L
,
Schmidt, Bertil
,
Liu, Yongchao
in
Algorithms
,
Analysis
,
Biomedical and Life Sciences
2010
Background
Due to its high sensitivity, the Smith-Waterman algorithm is widely used for biological database searches. Unfortunately, the quadratic time complexity of this algorithm makes it highly time-consuming. The exponential growth of biological databases further deteriorates the situation. To accelerate this algorithm, many efforts have been made to develop techniques in high performance architectures, especially the recently emerging many-core architectures and their associated programming models.
Findings
This paper describes the latest release of the CUDASW++ software, CUDASW++ 2.0, which makes new contributions to Smith-Waterman protein database searches using compute unified device architecture (CUDA). A parallel Smith-Waterman algorithm is proposed to further optimize the performance of CUDASW++ 1.0 based on the single instruction, multiple thread (SIMT) abstraction. For the first time, we have investigated a partitioned vectorized Smith-Waterman algorithm using CUDA based on the virtualized single instruction, multiple data (SIMD) abstraction. The optimized SIMT and the partitioned vectorized algorithms were benchmarked, and remarkably, have similar performance characteristics. CUDASW++ 2.0 achieves performance improvement over CUDASW++ 1.0 as much as 1.74 (1.72) times using the optimized SIMT algorithm and up to 1.77 (1.66) times using the partitioned vectorized algorithm, with a performance of up to 17 (30) billion cells update per second (GCUPS) on a single-GPU GeForce GTX 280 (dual-GPU GeForce GTX 295) graphics card.
Conclusions
CUDASW++ 2.0 is publicly available open-source software, written in CUDA and C++ programming languages. It obtains significant performance improvement over CUDASW++ 1.0 using either the optimized SIMT algorithm or the partitioned vectorized algorithm for Smith-Waterman protein database searches by fully exploiting the compute capability of commonly used CUDA-enabled low-cost GPUs.
Journal Article
RainDrop: Rapid activation matrix computation for droplet-based single-cell RNA-seq reads
2020
Background
Obtaining data from single-cell transcriptomic sequencing allows for the investigation of cell-specific gene expression patterns, which could not be addressed a few years ago. With the advancement of droplet-based protocols the number of studied cells continues to increase rapidly. This establishes the need for software tools for efficient processing of the produced large-scale datasets. We address this need by presenting RainDrop for fast gene-cell count matrix computation from single-cell RNA-seq data produced by 10x Genomics Chromium technology.
Results
RainDrop can process single-cell transcriptomic datasets consisting of 784 million reads sequenced from around 8.000 cells in less than 40 minutes on a standard workstation. It significantly outperforms the established Cell Ranger pipeline and the recently introduced Alevin tool in terms of runtime by a maximal (average) speedup of 30.4 (22.6) and 3.5 (2.4), respectively, while keeping high agreements of the generated results.
Conclusions
RainDrop is a software tool for highly efficient processing of large-scale droplet-based single-cell RNA-seq datasets on standard workstations written in C++. It is available at
https://gitlab.rlp.net/stnieble/raindrop
.
Journal Article
CUSHAW3: Sensitive and Accurate Base-Space and Color-Space Short-Read Alignment with Hybrid Seeding
by
Popp, Bernt
,
Schmidt, Bertil
,
Liu, Yongchao
in
Algorithms
,
Alignment
,
Base Sequence - genetics
2014
The majority of next-generation sequencing short-reads can be properly aligned by leading aligners at high speed. However, the alignment quality can still be further improved, since usually not all reads can be correctly aligned to large genomes, such as the human genome, even for simulated data. Moreover, even slight improvements in this area are important but challenging, and usually require significantly more computational endeavor. In this paper, we present CUSHAW3, an open-source parallelized, sensitive and accurate short-read aligner for both base-space and color-space sequences. In this aligner, we have investigated a hybrid seeding approach to improve alignment quality, which incorporates three different seed types, i.e. maximal exact match seeds, exact-match k-mer seeds and variable-length seeds, into the alignment pipeline. Furthermore, three techniques: weighted seed-pairing heuristic, paired-end alignment pair ranking and read mate rescuing have been conceived to facilitate accurate paired-end alignment. For base-space alignment, we have compared CUSHAW3 to Novoalign, CUSHAW2, BWA-MEM, Bowtie2 and GEM, by aligning both simulated and real reads to the human genome. The results show that CUSHAW3 consistently outperforms CUSHAW2, BWA-MEM, Bowtie2 and GEM in terms of single-end and paired-end alignment. Furthermore, our aligner has demonstrated better paired-end alignment performance than Novoalign for short-reads with high error rates. For color-space alignment, CUSHAW3 is consistently one of the best aligners compared to SHRiMP2 and BFAST. The source code of CUSHAW3 and all simulated data are available at http://cushaw3.sourceforge.net.
Journal Article
A big data approach to metagenomics for all-food-sequencing
2020
Background
All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires the comparison of sequence reads to large collections of reference genomes. The steadily increasing amount of available reference genomes establishes the need for efficient big data approaches.
Results
We introduce an alignment-free
k
-mer based method for detection and quantification of species composition in food and other complex biological matters. It is orders-of-magnitude faster than our previous alignment-based AFS pipeline. In comparison to the established tools CLARK, Kraken2, and Kraken2+Bracken it is superior in terms of false-positive rate and quantification accuracy. Furthermore, the usage of an efficient database partitioning scheme allows for the processing of massive collections of reference genomes with reduced memory requirements on a workstation (AFS-MetaCache) or on a Spark-based compute cluster (MetaCacheSpark).
Conclusions
We present a fast yet accurate screening method for whole genome shotgun sequencing-based biosurveillance applications such as food testing. By relying on a big data approach it can scale efficiently towards large-scale collections of complex eukaryotic and bacterial reference genomes. AFS-MetaCache and MetaCacheSpark are suitable tools for broad-scale metagenomic screening applications. They are available at
https://muellan.github.io/metacache/afs.html
(C++ version for a workstation) and
https://github.com/jmabuin/MetaCacheSpark
(Spark version for big data clusters).
Journal Article