Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
23 result(s) for "Linderman, Michael D"
Sort by:
Computational solutions to large-scale data management and analysis
Key Points Biological research is becoming ever more information-driven, with individual laboratories now capable of generating terabyte scales of data in days. Supercomputing resources will be increasingly needed to get the most from the big data sets that researchers generate or analyse. The big data revolution in biology is matched by a revolution in high-performance computing that is making supercomputing resources available to anyone with an internet connection. A number of challenges are posed by large-scale data analysis, including data transfer (bringing the data and computational resources together), controlling access to the data, managing the data, standardizing data formats and integrating data of multiple different types to accurately model biological systems. New computational solutions that are readily available to all can aid in addressing these challenges. These solutions include cloud-based computing and high-speed, low-cost heterogeneous computational environments. Taking advantage of these resources requires a thorough understanding of the data and the computational problem. Knowing the parallelization of the analysis algorithms enables a more efficient solution to a computational problem by distributing tasks over many computer processors. The types of parallelism can be classified into two broad categories: loosely coupled (or coarse-grained) parallelism and tightly coupled (or fine-grained) parallelism, each benefiting from different types of computational platforms, depending on the problem of interest. Clusters of computers can be optimized for many different classes of computationally intense applications, such as sequence alignment, genome-wide association tests and reconstruction of Bayesian networks. Cloud computing makes cluster-based computing more accessible and affordable for all. The distributed computing paradigm MapReduce has been designed for cloud-based computing to solve problems such as mapping raw DNA sequence reads to a reference genome (that is, problems that have loosely coupled parallelism). Cloud computing provides a highly flexible, low-cost computational environment. However, the costs of cloud computing include sacrificing control of the underlying hardware and requiring that big data sets be transferred into the cloud for processing. Heterogeneous multi-core computational systems, such as graphics processing units (GPUs), are complementary to cloud-based computing and operate as low-cost, specialized accelerators that can increase peak arithmetic throughput by 10-fold to 100-fold. These systems are specifically tuned to efficiently solve problems involving massive tightly coupled parallelism. Heterogeneous computing provides a low-cost, flexible computational environment that improves performance and efficiency by exposing architectural features to programmers. However, programming applications to run in these environments requires significant informatics expertise. Cloud providers such as Microsoft make advanced cloud computing resources freely available to individual researchers through a competitive, peer-reviewed granting process. Others providers, such as Amazon, provide advanced cloud storage and computational resources via an intuitive and simple web interface. Users of Amazon Web Services can today not only upload big data sets and analysis tools to Amazon S3 but also solve problems using MapReduce via a point-and-click interface. This Review describes the different types of computational environments — such as cloud and heterogeneous computing — that are increasingly being used by life scientists to manage and analyse large multidimensional data sets. Today we can generate hundreds of gigabases of DNA and RNA sequencing data in a week for less than US$5,000. The astonishing rate of data generation by these low-cost, high-throughput technologies in genomics is being matched by that of other technologies, such as real-time imaging and mass spectrometry-based flow cytometry. Success in the life sciences will depend on our ability to properly interpret the large-scale, high-dimensional data sets that are generated by these technologies, which in turn requires us to adopt advances in informatics. Here we discuss how we can master the different types of computational environments that exist — such as cloud and heterogeneous computing — to successfully tackle our big data problems.
Motivations, concerns and preferences of personal genome sequencing research participants: Baseline findings from the HealthSeq project
Whole exome/genome sequencing (WES/WGS) is increasingly offered to ostensibly healthy individuals. Understanding the motivations and concerns of research participants seeking out personal WGS and their preferences regarding return-of-results and data sharing will help optimize protocols for WES/WGS. Baseline interviews including both qualitative and quantitative components were conducted with research participants (n=35) in the HealthSeq project, a longitudinal cohort study of individuals receiving personal WGS results. Data sharing preferences were recorded during informed consent. In the qualitative interview component, the dominant motivations that emerged were obtaining personal disease risk information, satisfying curiosity, contributing to research, self-exploration and interest in ancestry, and the dominant concern was the potential psychological impact of the results. In the quantitative component, 57% endorsed concerns about privacy. Most wanted to receive all personal WGS results (94%) and their raw data (89%); a third (37%) consented to having their data shared to the Database of Genotypes and Phenotypes (dbGaP). Early adopters of personal WGS in the HealthSeq project express a variety of health- and non-health-related motivations. Almost all want all available findings, while also expressing concerns about the psychological impact and privacy of their results.
DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark
Background XHMM is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing data but can require hours to days to run for large cohorts. A more scalable implementation would reduce the need for specialized computational resources and enable increased exploration of the configuration parameter space to obtain the best possible results. Results DECA is a horizontally scalable implementation of the XHMM algorithm using the ADAM framework and Apache Spark that incorporates novel algorithmic optimizations to eliminate unneeded computation. DECA parallelizes XHMM on both multi-core shared memory computers and large shared-nothing Spark clusters. We performed CNV discovery from the read-depth matrix in 2535 exomes in 9.3 min on a 16-core workstation (35.3× speedup vs. XHMM), 12.7 min using 10 executor cores on a Spark cluster (18.8× speedup vs. XHMM), and 9.8 min using 32 executor cores on Amazon AWS’ Elastic MapReduce. We performed CNV discovery from the original BAM files in 292 min using 640 executor cores on a Spark cluster. Conclusions We describe DECA’s performance, our algorithmic and implementation enhancements to XHMM to obtain that performance, and our lessons learned porting a complex genome analysis application to ADAM and Spark. ADAM and Apache Spark are a performant and productive platform for implementing large-scale genome analyses, but efficiently utilizing large clusters can require algorithmic optimizations and careful attention to Spark’s configuration parameters.
Predispositional genome sequencing in healthy adults: design, participant characteristics, and early outcomes of the PeopleSeq Consortium
Background Increasing numbers of healthy individuals are undergoing predispositional personal genome sequencing. Here we describe the design and early outcomes of the PeopleSeq Consortium, a multi-cohort collaboration of predispositional genome sequencing projects, which is examining the medical, behavioral, and economic outcomes of returning genomic sequencing information to healthy individuals. Methods Apparently healthy adults who participated in four of the sequencing projects in the Consortium were included. Web-based surveys were administered before and after genomic results disclosure, or in some cases only after results disclosure. Surveys inquired about sociodemographic characteristics, motivations and concerns, behavioral and medical responses to sequencing results, and perceived utility. Results Among 1395 eligible individuals, 658 enrolled in the Consortium when contacted and 543 have completed a survey after receiving their genomic results thus far (mean age 53.0 years, 61.4% male, 91.7% white, 95.5% college graduates). Most participants (98.1%) were motivated to undergo sequencing because of curiosity about their genetic make-up. The most commonly reported concerns prior to pursuing sequencing included how well the results would predict future risk (59.2%) and the complexity of genetic variant interpretation (56.8%), while 47.8% of participants were concerned about the privacy of their genetic information. Half of participants reported discussing their genomic results with a healthcare provider during a median of 8.0 months after receiving the results; 13.5% reported making an additional appointment with a healthcare provider specifically because of their results. Few participants (< 10%) reported making changes to their diet, exercise habits, or insurance coverage because of their results. Many participants (39.5%) reported learning something new to improve their health that they did not know before. Reporting regret or harm from the decision to undergo sequencing was rare (< 3.0%). Conclusions Healthy individuals who underwent predispositional sequencing expressed some concern around privacy prior to pursuing sequencing, but were enthusiastic about their experience and not distressed by their results. While reporting value in their health-related results, few participants reported making medical or lifestyle changes.
Analytical validation of whole exome and whole genome sequencing for clinical applications
Background Whole exome and genome sequencing (WES/WGS) is now routinely offered as a clinical test by a growing number of laboratories. As part of the test design process each laboratory must determine the performance characteristics of the platform, test and informatics pipeline. This report documents one such characterization of WES/WGS. Methods Whole exome and whole genome sequencing was performed on multiple technical replicates of five reference samples using the Illumina HiSeq 2000/2500. The sequencing data was processed with a GATK-based genome analysis pipeline to evaluate: intra-run, inter-run, inter-mode, inter-machine and inter-library consistency, concordance with orthogonal technologies (microarray, Sanger) and sensitivity and accuracy relative to known variant sets. Results Concordance to high-density microarrays consistently exceeds 97% (and typically exceeds 99%) and concordance between sequencing replicates also exceeds 97%, with no observable differences between different flow cells, runs, machines or modes. Sensitivity relative to high-density microarray variants exceeds 95%. In a detailed study of a 129 kb region, sensitivity was lower with some validated single-base insertions and deletions “not called”. Different variants are \"not called\" in each replicate: of all variants identified in WES data from the NA12878 reference sample 74% of indels and 89% of SNVs were called in all seven replicates, in NA12878 WGS 52% of indels and 88% of SNVs were called in all six replicates. Key sources of non-uniformity are variance in depth of coverage and artifactual variants resulting from repetitive regions and larger structural variants. Conclusion We report a comprehensive performance characterization of WES/WGS that will be relevant to offering laboratories, consumers of genome sequencing and others interested in the analytical validity of this technology.
MySeq: privacy-protecting browser-based personal Genome analysis for genomics education and exploration
Background The complexity of genome informatics is a recurring challenge for genome exploration and analysis by students and other non-experts. This complexity creates a barrier to wider implementation of experiential genomics education, even in settings with substantial computational resources and expertise. Reducing the need for specialized software tools will increase access to hands-on genomics pedagogy. Results MySeq is a React.js single-page web application for privacy-protecting interactive personal genome analysis. All analyses are performed entirely in the user’s web browser eliminating the need to install and use specialized software tools or to upload sensitive data to an external web service. MySeq leverages Tabix-indexing to efficiently query whole genome-scale variant call format (VCF) files stored locally or available remotely via HTTP(s) without loading the entire file. MySeq currently implements variant querying and annotation, physical trait prediction, pharmacogenomic, polygenic disease risk and ancestry analyses to provide representative pedagogical examples; and can be readily extended with new analysis or visualization components. Conclusions MySeq supports multiple pedagogical approaches including independent exploration and interactive online tutorials. MySeq has been successfully employed in an undergraduate human genome analysis course where it reduced the barriers-to-entry for hands-on human genome analysis.
Development and Validation of a Comprehensive Genomics Knowledge Scale
Background: Genomic testing is increasingly employed in clinical, research, educational, and commercial contexts. Genomic literacy is a prerequisite for the effective application of genomic testing, creating a corresponding need for validated tools to assess genomics knowledge. We sought to develop a reliable measure of genomics knowledge that incorporates modern genomic technologies and is informative for individuals with diverse backgrounds, including those with clinical/life sciences training. Methods: We developed the GKnowM Genomics Knowledge Scale to assess the knowledge needed to make an informed decision for genomic testing, appropriately apply genomic technologies and participate in civic decision-making. We administered the 30-item draft measure to a calibration cohort (n = 1,234) and subsequent participants to create a combined validation cohort (n = 2,405). We performed a multistage psychometric calibration and validation using classical test theory and item response theory (IRT) and conducted a post-hoc simulation study to evaluate the suitability of a computerized adaptive testing (CAT) implementation. Results: Based on exploratory factor analysis, we removed 4 of the 30 draft items. The resulting 26-item GKnowM measure has a single dominant factor. The scale internal consistency is α = 0.85, and the IRT 3-PL model demonstrated good overall and item fit. Validity is demonstrated with significant correlation (r = 0.61) with an existing genomics knowledge measure and significantly higher scores for individuals with adequate health literacy and healthcare providers (HCPs), including HCPs who work with genomic testing. The item bank is well suited to CAT, achieving high accuracy (r = 0.97 with the full measure) while administering a mean of 13.5 items. Conclusion: GKnowM is an updated, broadly relevant, rigorously validated 26-item measure for assessing genomics knowledge that we anticipate will be useful for assessing population genomic literacy and evaluating the effectiveness of genomics educational interventions.
Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE
New instruments can measure the presence of >30 molecular markers for massive numbers of single cells, but data analysis algorithms have lagged behind. Qiu et al . describe an approach called SPADE for recovering cellular hierarchies from mass or flow cytometry data. The ability to analyze multiple single-cell parameters is critical for understanding cellular heterogeneity. Despite recent advances in measurement technology, methods for analyzing high-dimensional single-cell data are often subjective, labor intensive and require prior knowledge of the biological system. To objectively uncover cellular heterogeneity from single-cell measurements, we present a versatile computational approach, spanning-tree progression analysis of density-normalized events (SPADE). We applied SPADE to flow cytometry data of mouse bone marrow and to mass cytometry data of human bone marrow. In both cases, SPADE organized cells in a hierarchy of related phenotypes that partially recapitulated well-described patterns of hematopoiesis. We demonstrate that SPADE is robust to measurement noise and to the choice of cellular markers. SPADE facilitates the analysis of cellular heterogeneity, the identification of cell types and comparison of functional markers in response to perturbations.
Impacts of incorporating personal genome sequencing into graduate genomics education: a longitudinal study over three course years
Background To address the need for more effective genomics training, beginning in 2012 the Icahn School of Medicine at Mount Sinai has offered a unique laboratory-style graduate genomics course, “Practical Analysis of Your Personal Genome” (PAPG), in which students optionally sequence and analyze their own whole genome. We hypothesized that incorporating personal genome sequencing (PGS) into the course pedagogy could improve educational outcomes by increasing student motivation and engagement. Here we extend our initial study of the pilot PAPG cohort with a report on student attitudes towards genome sequencing, decision-making, psychological wellbeing, genomics knowledge and pedagogical engagement across three course years. Methods Students enrolled in the 2013, 2014 and 2015 course years completed questionnaires before (T1) and after (T2) a prerequisite workshop ( n  = 110) and before (T3) and after (T4) PAPG ( n  = 66). Results Students’ interest in PGS was high; 56 of 59 eligible students chose to sequence their own genome. Decisional conflict significantly decreased after the prerequisite workshop (T2 vs. T1 p  < 0.001). Most, but not all students, reported low levels of decision regret and test-related distress post-course (T4). Each year baseline decisional conflict decreased ( p  < 0.001) suggesting, that as the course became more established, students increasingly made their decision prior to enrolling in the prerequisite workshop. Students perceived that analyzing their own genome enhanced the genomics pedagogy, with students self-reporting being more persistent and engaged as a result of analyzing their own genome. More than 90% of respondents reported spending additional time outside of course assignments analyzing their genome. Conclusions Incorporating personal genome sequencing in graduate medical education may improve student motivation and engagement. However, more data will be needed to quantitatively evaluate whether incorporating PGS is more effective than other educational approaches.
Psychological and behavioural impact of returning personal results from whole-genome sequencing: the HealthSeq project
Providing ostensibly healthy individuals with personal results from whole-genome sequencing could lead to improved health and well-being via enhanced disease risk prediction, prevention, and diagnosis, but also poses practical and ethical challenges. Understanding how individuals react psychologically and behaviourally will be key in assessing the potential utility of personal whole-genome sequencing. We conducted an exploratory longitudinal cohort study in which quantitative surveys and in-depth qualitative interviews were conducted before and after personal results were returned to individuals who underwent whole-genome sequencing. The participants were offered a range of interpreted results, including Alzheimer's disease, type 2 diabetes, pharmacogenomics, rare disease-associated variants, and ancestry. They were also offered their raw data. Of the 35 participants at baseline, 29 (82.9%) completed the 6-month follow-up. In the quantitative surveys, test-related distress was low, although it was higher at 1-week than 6-month follow-up (Z=2.68, P=0.007). In the 6-month qualitative interviews, most participants felt happy or relieved about their results. A few were concerned, particularly about rare disease-associated variants and Alzheimer's disease results. Two of the 29 participants had sought clinical follow-up as a direct or indirect consequence of rare disease-associated variants results. Several had mentioned their results to their doctors. Some participants felt having their raw data might be medically useful to them in the future. The majority reported positive reactions to having their genomes sequenced, but there were notable exceptions to this. The impact and value of returning personal results from whole-genome sequencing when implemented on a larger scale remains to be seen.