Asset Details

MbrlCatalogueTitleDetail

Do you wish to reserve the book?

Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework

by Al-Ars, Zaid , Hofstee, H. Peter , Ahmad, Tanveer , Ahmed, Nauman

in Algorithms / Animal Genetics and Genomics / Apache Arrow / Big Data / Biomedical and Life Sciences / Computation / Data processing / Datasets / Deoxyribonucleic acid / DNA / Dynamic cell / Dynamic random access memory / Electronic data processing / Format / GATK Best Practices / Gene sequencing / Genomes / Genomics / High-Throughput Nucleotide Sequencing / In-Memory Data / Latency / Life Sciences / Methods / Microarrays / Microbial Genetics and Genomics / Microprocessors / Mutation / Next-generation sequencing / Nucleotide sequence / Parallel processing / Plant Genetics and Genomics / Programming languages / Proteomics / Representations / Resource utilization / Software / Tables (data) / Technology application / UNIX / Volatility / Whole Genome Sequencing / Whole Genome/Exome Sequencing / Workflow

2020

Yes Please

Hey, we have placed the reservation for you!

By the way, why not check out events that you can attend while you pick your title.

Oops! Something went wrong.

Looks like we were not able to place the reservation. Kindly try again later.

Are you sure you want to remove the book from the shelf?

Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework

by Al-Ars, Zaid , Hofstee, H. Peter , Ahmad, Tanveer , Ahmed, Nauman

2020

Confirm

Do you wish to request the book?

Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework

by Al-Ars, Zaid , Hofstee, H. Peter , Ahmad, Tanveer , Ahmed, Nauman

2020

Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy

How would you like to get it?

Submit

We have requested the book for you!

Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.

Oops! Something went wrong.

Looks like we were not able to place your request. Kindly try again later.

Journal Article

Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework

Al-Ars, Zaid,

Hofstee, H. Peter,

Ahmad, Tanveer,

Ahmed, Nauman

2020

Overview

Background Immense improvements in sequencing technologies enable producing large amounts of high throughput and cost effective next-generation sequencing (NGS) data. This data needs to be processed efficiently for further downstream analyses. Computing systems need this large amounts of data closer to the processor (with low latency) for fast and efficient processing. However, existing workflows depend heavily on disk storage and access, to process this data incurs huge disk I/O overheads. Previously, due to the cost, volatility and other physical constraints of DRAM memory, it was not feasible to place large amounts of working data sets in memory. However, recent developments in storage-class memory and non-volatile memory technologies have enabled computing systems to place huge data in memory to process it directly from memory to avoid disk I/O bottlenecks. To exploit the benefits of such memory systems efficiently, proper formatted data placement in memory and its high throughput access is necessary by avoiding (de)-serialization and copy overheads in between processes. For this purpose, we use the newly developed Apache Arrow, a cross-language development framework that provides language-independent columnar in-memory data format for efficient in-memory big data analytics. This allows genomics applications developed in different programming languages to communicate in-memory without having to access disk storage and avoiding (de)-serialization and copy overheads. Implementation We integrate Apache Arrow in-memory based Sequence Alignment/Map (SAM) format and its shared memory objects store library in widely used genomics high throughput data processing applications like BWA-MEM, Picard and GATK to allow in-memory communication between these applications. In addition, this also allows us to exploit the cache locality of tabular data and parallel processing capabilities through shared memory objects. Results Our implementation shows that adopting in-memory SAM representation in genomics high throughput data processing applications results in better system resource utilization, low number of memory accesses due to high cache locality exploitation and parallel scalability due to shared memory objects. Our implementation focuses on the GATK best practices recommended workflows for germline analysis on whole genome sequencing (WGS) and whole exome sequencing (WES) data sets. We compare a number of existing in-memory data placing and sharing techniques like ramDisk and Unix pipes to show how columnar in-memory data representation outperforms both. We achieve a speedup of 4.85x and 4.76x for WGS and WES data, respectively, in overall execution time of variant calling workflows. Similarly, a speedup of 1.45x and 1.27x for these data sets, respectively, is achieved, as compared to the second fastest workflow. In some individual tools, particularly in sorting, duplicates removal and base quality score recalibration the speedup is even more promising. Availability The code and scripts used in our experiments are available in both container and repository form at: https://github.com/abs-tudelft/ArrowSAM .

Share this book

Add to My Shelf

Publisher

BioMed Central,BioMed Central Ltd,Springer Nature B.V,BMC

Subject

Algorithms

/ Animal Genetics and Genomics

/ Apache Arrow

/ Big Data

/ Biomedical and Life Sciences

/ Computation

/ Data processing