Catalogue Search | MBRL

Computational methods in biomedical research

by Khattree, Ravindra , Naik, Dayanand N in Medicine Research Data processing. , Biology Research Data processing. , Medicine Research Statistical methods.

Book

Share this book

Add to My Shelf

Statistical advances in the biomedical sciences

by Datta, Sujay , Biswas, Atanu , Fine, Jason P in Bioinformatics , Biology , Biology -- Research -- Statistical methods

2007,2008

The Most Comprehensive and Cutting-Edge Guide to Statistical Applications in Biomedical Research With the increasing use of biotechnology in medical research and the sophisticated advances in computing, it has become essential for practitioners in the biomedical sciences to be fully educated on the role statistics plays in ensuring the accurate analysis of research findings. Statistical Advances in the Biomedical Sciences explores the growing value of statistical knowledge in the management and comprehension of medical research and, more specifically, provides an accessible introduction to the contemporary methodologies used to understand complex problems in the four major areas of modern-day biomedical science: clinical trials, epidemiology, survival analysis, and bioinformatics. Composed of contributions from eminent researchers in the field, this volume discusses the application of statistical techniques to various aspects of modern medical research and illustrates how these methods ultimately prove to be an indispensable part of proper data collection and analysis. A structural uniformity is maintained across all chapters, each beginning with an introduction that discusses general concepts and the biomedical problem under focus and is followed by specific details on the associated methods, algorithms, and applications. In addition, each chapter provides a summary of the main ideas and offers a concluding remarks section that presents novel ideas, approaches, and challenges for future research. Complete with detailed references and insight on the future directions of biomedical research, Statistical Advances in the Biomedical Sciences provides vital statistical guidance to practitioners in the biomedical sciences while also introducing statisticians to new, multidisciplinary frontiers of application. This text is an excellent reference for graduate- and PhD-level courses in various areas of biostatistics and the medical sciences and also serves as a valuable tool for medical researchers, statisticians, public health professionals, and biostatisticians.

eBook

Share this book

Add to My Shelf

mixOmics: An R package for ‘omics feature selection and multiple data integration

by Gautier, Benoît , Lê Cao, Kim-Anh , Rohart, Florian in Bioinformatics , Biological analysis , Biology

2017

The advent of high throughput technologies has led to a wealth of publicly available 'omics data coming from different sources, such as transcriptomics, proteomics, metabolomics. Combining such large-scale biological data sets can lead to the discovery of important biological insights, provided that relevant information can be extracted in a holistic manner. Current statistical approaches have been focusing on identifying small subsets of molecules (a 'molecular signature') to explain or predict biological conditions, but mainly for a single type of 'omics. In addition, commonly used methods are univariate and consider each biological feature independently. We introduce mixOmics, an R package dedicated to the multivariate analysis of biological data sets with a specific focus on data exploration, dimension reduction and visualisation. By adopting a systems biology approach, the toolkit provides a wide range of methods that statistically integrate several data sets at once to probe relationships between heterogeneous 'omics data sets. Our recent methods extend Projection to Latent Structure (PLS) models for discriminant analysis, for data integration across multiple 'omics data or across independent studies, and for the identification of molecular signatures. We illustrate our latest mixOmics integrative frameworks for the multivariate analyses of 'omics data available from the package.

Journal Article

Share this book

Add to My Shelf

Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data

by Startek, Michał , Miasojedow, BłaŻej , Gambin, Anna in Algorithms , Animals , Aquatic habitats

2019

Background A survey of presences and absences of specific species across multiple biogeographic units (or bioregions) are used in a broad area of biological studies from ecology to microbiology. Using binary presence-absence data, we evaluate species co-occurrences that help elucidate relationships among organisms and environments. To summarize similarity between occurrences of species, we routinely use the Jaccard/Tanimoto coefficient, which is the ratio of their intersection to their union. It is natural, then, to identify statistically significant Jaccard/Tanimoto coefficients, which suggest non-random co-occurrences of species. However, statistical hypothesis testing using this similarity coefficient has been seldom used or studied. Results We introduce a hypothesis test for similarity for biological presence-absence data, using the Jaccard/Tanimoto coefficient. Several key improvements are presented including unbiased estimation of expectation and centered Jaccard/Tanimoto coefficients, that account for occurrence probabilities. The exact and asymptotic solutions are derived. To overcome a computational burden due to high-dimensionality, we propose the bootstrap and measurement concentration algorithms to efficiently estimate statistical significance of binary similarity. Comprehensive simulation studies demonstrate that our proposed methods produce accurate p -values and false discovery rates. The proposed estimation methods are orders of magnitude faster than the exact solution, particularly with an increasing dimensionality. We showcase their applications in evaluating co-occurrences of bird species in 28 islands of Vanuatu and fish species in 3347 freshwater habitats in France. The proposed methods are implemented in an open source R package called jaccard ( https://cran.r-project.org/package=jaccard ). Conclusion We introduce a suite of statistical methods for the Jaccard/Tanimoto similarity coefficient for binary data, that enable straightforward incorporation of probabilistic measures in analysis for species co-occurrences. Due to their generality, the proposed methods and implementations are applicable to a wide range of binary data arising from genomics, biochemistry, and other areas of science.

Journal Article

Share this book

Add to My Shelf

Ten quick tips for effective dimensionality reduction

by Nguyen, Lan Huong , Holmes, Susan in Artificial intelligence , Bioinformatics , Biological research

2019

Both a means of denoising and simplification, it can be beneficial for the majority of modern biological datasets, in which it’s not uncommon to have hundreds or even millions of simultaneous measurements collected for a single sample. Because of “the curse of dimensionality,” many statistical methods lack power when applied to high-dimensional data. Formally, the Marchenko–Pastur distribution asymptotically models the distribution of the singular values of large random matrices. [...]for datasets large in both the number of observations and features, you use a rule of retaining only eigenvalues outside the support of the fitted Marchenko–Pastur distribution; however, remember that this applies only when your data have at least thousands of samples and thousands of features. [...]the height-to-width ratio of a PCA plot should be consistent with the ratio between the corresponding eigenvalues. Because eigenvalues reflect the variance in coordinates of the associated PCs, you only need to ensure that in the plots, one \"unit\" in direction of one PC has the same length as one \"unit\" in direction of another PC. Because batch effects can confound the signal of interest, it is a good practice to check for their presence and, if found, to remove them before proceeding with further downstream analysis.

Journal Article

Share this book

Add to My Shelf

Statistical power for cluster analysis

by Dalmaijer, Edwin S. , Nord, Camilla L. , Astle, Duncan E. in Algorithms , Bioinformatics , Biomedical and Life Sciences

2022

Background Cluster algorithms are gaining in popularity in biomedical research due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream software. While guidelines exist for algorithm selection and outcome evaluation, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we estimated power and classification accuracy for common analysis pipelines through simulation. We systematically varied subgroup size, number, separation (effect size), and covariance structure. We then subjected generated datasets to dimensionality reduction approaches (none, multi-dimensional scaling, or uniform manifold approximation and projection) and cluster algorithms (k-means, agglomerative hierarchical clustering with Ward or average linkage and Euclidean or cosine distance, HDBSCAN). Finally, we directly compared the statistical power of discrete (k-means), “fuzzy” (c-means), and finite mixture modelling approaches (which include latent class analysis and latent profile analysis). Results We found that clustering outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were mostly unaffected by differences in covariance structure. Sufficient statistical power was achieved with relatively small samples (N = 20 per subgroup), provided cluster separation is large (Δ = 4). Finally, we demonstrated that fuzzy clustering can provide a more parsimonious and powerful alternative for identifying separable multivariate normal distributions, particularly those with slightly lower centroid separation (Δ = 3). Conclusions Traditional intuitions about statistical power only partially apply to cluster analysis: increasing the number of participants above a sufficient sample size did not improve power, but effect size was crucial. Notably, for the popular dimensionality reduction and clustering algorithms tested here, power was only satisfactory for relatively large effect sizes (clear separation between subgroups). Fuzzy clustering provided higher power in multivariate normal distributions. Overall, we recommend that researchers (1) only apply cluster analysis when large subgroup separation is expected, (2) aim for sample sizes of N = 20 to N = 30 per expected subgroup, (3) use multi-dimensional scaling to improve cluster separation, and (4) use fuzzy clustering or mixture modelling approaches that are more powerful and more parsimonious with partially overlapping multivariate normal distributions.

Journal Article

Share this book

Add to My Shelf

The triumphs and limitations of computational methods for scRNA-seq

by Kharchenko Peter V in Algorithms , Approximation , Biology

2021

The rapid progress of protocols for sequencing single-cell transcriptomes over the past decade has been accompanied by equally impressive advances in the computational methods for analysis of such data. As capacity and accuracy of the experimental techniques grew, the emerging algorithm developments revealed increasingly complex facets of the underlying biology, from cell type composition to gene regulation to developmental dynamics. At the same time, rapid growth has forced continuous reevaluation of the underlying statistical models, experimental aims, and sheer volumes of data processing that are handled by these computational tools. Here, I review key computational steps of single-cell RNA sequencing (scRNA-seq) analysis, examine assumptions made by different approaches, and highlight successes, remaining ambiguities, and limitations that are important to keep in mind as scRNA-seq becomes a mainstream technique for studying biology.This review provides an overview of recent computational developments in scRNA-seq analysis and highlights packages and tools applied in executing these analyses.

Journal Article

Share this book

Add to My Shelf

PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph

by Burlot, Laura , Planel, Rémi , Bazin, Adelme in Algorithms , Bacteria - classification , Bacteria - genetics

2020

Microorganisms have the greatest biodiversity and evolutionary history on earth. At the genomic level, it is reflected by a highly variable gene content even among organisms from the same species which explains the ability of microbes to be pathogenic or to grow in specific environments. We developed a new method called PPanGGOLiN which accurately represents the genomic diversity of a species (i.e. its pangenome) using a compact graph structure. Based on this pangenome graph, we classify genes by a statistical method according to their occurrence in the genomes. This method allowed us to build pangenomes even for uncultivated species at an unprecedented scale. We applied our method on all available genomes in databanks in order to depict the overall diversity of hundreds of species. Overall, our work enables microbiologists to explore and visualize pangenomes alike a subway map.

Journal Article

Share this book

Add to My Shelf

SARTools: A DESeq2- and EdgeR-Based R Pipeline for Comprehensive Differential Analysis of RNA-Seq Data

by Brillet-Guéguen, Loraine , Dillies, Marie-Agnès , Varet, Hugo in Binomial distribution , Bioinformatics , Biology

2016

Several R packages exist for the detection of differentially expressed genes from RNA-Seq data. The analysis process includes three main steps, namely normalization, dispersion estimation and test for differential expression. Quality control steps along this process are recommended but not mandatory, and failing to check the characteristics of the dataset may lead to spurious results. In addition, normalization methods and statistical models are not exchangeable across the packages without adequate transformations the users are often not aware of. Thus, dedicated analysis pipelines are needed to include systematic quality control steps and prevent errors from misusing the proposed methods. SARTools is an R pipeline for differential analysis of RNA-Seq count data. It can handle designs involving two or more conditions of a single biological factor with or without a blocking factor (such as a batch effect or a sample pairing). It is based on DESeq2 and edgeR and is composed of an R package and two R script templates (for DESeq2 and edgeR respectively). Tuning a small number of parameters and executing one of the R scripts, users have access to the full results of the analysis, including lists of differentially expressed genes and a HTML report that (i) displays diagnostic plots for quality control and model hypotheses checking and (ii) keeps track of the whole analysis process, parameter values and versions of the R packages used. SARTools provides systematic quality controls of the dataset as well as diagnostic plots that help to tune the model parameters. It gives access to the main parameters of DESeq2 and edgeR and prevents untrained users from misusing some functionalities of both packages. By keeping track of all the parameters of the analysis process it fits the requirements of reproducible research.

Journal Article

Share this book

Add to My Shelf

Count-based differential expression analysis of RNA sequencing data using R and Bioconductor

by Smyth, Gordon K , Huber, Wolfgang , Chen, Yunshun in 38/91 , 631/114/2415 , 631/1647/514/1949

2013

RNA sequencing (RNA-seq) has been rapidly adopted for the profiling of transcriptomes in many areas of biology, including studies into gene regulation, development and disease. Of particular interest is the discovery of differentially expressed genes across different conditions (e.g., tissues, perturbations) while optionally adjusting for other systematic factors that affect the data-collection process. There are a number of subtle yet crucial aspects of these analyses, such as read counting, appropriate treatment of biological variability, quality control checks and appropriate setup of statistical modeling. Several variations have been presented in the literature, and there is a need for guidance on current best practices. This protocol presents a state-of-the-art computational and statistical RNA-seq differential expression analysis workflow largely based on the free open-source R language and Bioconductor software and, in particular, on two widely used tools, DESeq and edgeR. Hands-on time for typical small experiments (e.g., 4–10 samples) can be <1 h, with computation time <1 d using a standard desktop PC.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter