Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Reading Level
      Reading Level
      Clear All
      Reading Level
  • Content Type
      Content Type
      Clear All
      Content Type
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
      More Filters
      Clear All
      More Filters
      Item Type
    • Is Full-Text Available
    • Subject
    • Publisher
    • Source
    • Donor
    • Language
    • Place of Publication
    • Contributors
    • Location
7,419 result(s) for "Data Preparation "
Sort by:
A data scientist's guide to acquiring, cleaning and managing data in R
The only how-to guide offering a unified, systemic approach to acquiring, cleaning, and managing data in REvery experienced practitioner knows that preparing data for modeling is a painstaking, time-consuming process. Adding to the difficulty is that most modelers learn the steps involved in cleaning and managing data piecemeal, often on the fly, or they develop their own ad hoc methods. This book helps simplify their task by providing a unified, systematic approach to acquiring, modeling, manipulating, cleaning, and maintaining data in R. Starting with the very basics, data scientists Samuel E. Buttrey and Lyn R. Whitaker walk readers through the entire process. From what data looks like and what it should look like, they progress through all the steps involved in getting data ready for modeling.  They describe best practices for acquiring data from numerous sources; explore key issues in data handling, including text/regular expressions, big data, parallel processing, merging, matching, and checking for duplicates; and outline highly efficient and reliable techniques for documenting data and recordkeeping, including audit trails, getting data back out of R, and more.The only single-source guide to R data and its preparation, it describes best practices for acquiring, manipulating, cleaning, and maintaining dataBegins with the basics and walks readers through all the steps necessary to get data ready for the modeling processProvides expert guidance on how to document the processes described so that they are reproducibleWritten by seasoned professionals, it provides both introductory and advanced techniquesFeatures case studies with supporting data and R code, hosted on a companion websiteA Data Scientist's Guide to Acquiring, Cleaning and Managing Data in R is a valuable working resource/bench manual for practitioners who collect and analyze data, lab scientists and research associates of all levels of experience, and graduate-level data mining students.
Data Cleaning Pocket Primer
As part of the best selling Pocket Primer series, this book is an effort to give programmers sufficient knowledge of data cleaning to be able to work on their own projects. It is designed as a practical introduction to using flexible, powerful (and free) Unix / Linux shell commands to perform common data cleaning tasks. The book is packed with realistic examples and numerous commands that illustrate both the syntax and how the commands work together. Companion files with source code are available for downloading from the publisher.Features:- A practical introduction to using flexible, powerful (and free) Unix / Linux shell commands to perform common data cleaning tasks- Includes the concept of piping data between commands, regular expression substitution, and the sed and awk commands- Packed with realistic examples and numerous commands that illustrate both the syntax and how the commands work together- Assumes the reader has no prior experience, but the topic is covered comprehensively enough to teach a pro some new tricks- Includes companion files with all of the source code examples (download from the publisher).
Agent-based computational economics using netlogo
This e-book explores how researchers can create, use and implement multi-agent computational models in Economics by using NetLogo software platform. Problems of economic science can be solved using multi-agent modelling (MAM). The 11 models presented in this text simulate the simultaneous operations of several agents in an attempt to recreate and predict complex economic phenomena.
Multi‐time‐point data preparation robustly reveals MCI and dementia risk factors
Introduction Conflicting results on dementia risk factors have been reported across studies. We hypothesize that variation in data preparation methods may partially contribute to this issue. Methods We propose a comprehensive data preparation approach comparing individuals with stable diagnosis over time to those who progress to mild cognitive impairment (MCI)/dementia. This was compared to the often‐used “baseline” analysis. Multivariate logistic regression was used to evaluate both methods. Results The results obtained from sensitivity analyses were consistent with those from our multi‐time‐point data preparation approach, exhibiting its robustness. Compared to analysis using only baseline data, the number of significant risk factors identified in progression analyses was substantially lower. Additionally, we found that moderate depression increased healthy‐to‐MCI/dementia risk, while hypertension reduced MCI‐to‐dementia risk. Discussion Overall, multi‐time‐point–based data preparation approaches may pave the way for a better understanding of dementia risk factors, and address some of the reproducibility issues in the field.
Table 0; documenting the steps to go from clinical database to research dataset
Data-driven decision support tools have been increasingly recognized to transform health care. However, such tools are often developed on predefined research datasets without adequate knowledge of the origin of this data and how it was selected. How a dataset is extracted from a clinical database can profoundly impact the validity, interpretability and interoperability of the dataset, and downstream analyses, yet is rarely reported. Therefore, we present a case study illustrating how a definitive patient list was extracted from a clinical source database and how this can be reported. A single-center observational study was performed at an academic hospital in the Netherlands to illustrate the impact of selecting a definitive patient list for research from a clinical source database, and the importance of documenting this process. All admissions from the critical care database admitted between January 1, 2013, and January 1, 2023, were used. An interdisciplinary team collaborated to identify and address potential sources of data insufficiency and uncertainty. We demonstrate a stepwise data preparation process, reducing the clinical source database of 54,218 admissions to a definitive patient list of 21,553 admissions. Transparent documentation of the data preparation process improves the quality of the definitive patient list before analysis of the corresponding patient data. This study generated seven important recommendations for preparing observational health-care data for research purposes. Documenting data preparation is essential for understanding a research dataset originating from a clinical source database before analyzing health-care data. The findings contribute to establishing data standards and offer insights into the complexities of preparing health-care data for scientific investigation. Meticulous data preparation and documentation thereof will improve research validity and advance critical care.
Driving Data Projects
Digital transformation and data projects are not new and yet, for many, they are a challenge. Driving Data Projects is a compelling guide that empowers data teams and professionals to navigate the complexities of data projects, fostering a more data-informed culture within their organizations. With practical insights and step-by-step methodologies, this guide provides a clear path how to drive data projects effectively in any organization, regardless of its sector or maturity level whilst also demonstrating how to overcome the overwhelming feelings of where to start and how to not lose momentum. This book offers the keys to identifying opportunities for driving data projects and how to overcome challenges to drive successful data initiatives. Driving Data Projects is highly practical and provides reflections, worksheets, checklists, activities, and tools making it accessible to students new to driving data projects and culture change. This book is also a must-have guide for data teams and professionals committed to unleashing the transformative power of data in their organizations.
SMOTE-RSB: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory
Imbalanced data is a common problem in classification. This phenomenon is growing in importance since it appears in most real domains. It has special relevance to highly imbalanced data-sets (when the ratio between classes is high). Many techniques have been developed to tackle the problem of imbalanced training sets in supervised learning. Such techniques have been divided into two large groups: those at the algorithm level and those at the data level. Data level groups that have been emphasized are those that try to balance the training sets by reducing the larger class through the elimination of samples or increasing the smaller one by constructing new samples, known as undersampling and oversampling, respectively. This paper proposes a new hybrid method for preprocessing imbalanced data-sets through the construction of new samples, using the Synthetic Minority Oversampling Technique together with the application of an editing technique based on the Rough Set Theory and the lower approximation of a subset. The proposed method has been validated by an experimental study showing good results using C4.5 as the learning algorithm.
PRIME-IPD SERIES Part 1. The PRIME-IPD tool promoted verification and standardization of study datasets retrieved for IPD meta-analysis
We describe a systematic approach to preparing data in the conduct of Individual Participant Data (IPD) analysis. A guidance paper proposing methods for preparing individual participant data for meta-analysis from multiple study sources, developed by consultation of relevant guidance and experts in IPD. We present an example of how these steps were applied in checking data for our own IPD meta analysis (IPD-MA). We propose five steps of Processing, Replication, Imputation, Merging, and Evaluation to prepare individual participant data for meta-analysis (PRIME-IPD). Using our own IPD-MA as an exemplar, we found that this approach identified missing variables and potential inconsistencies in the data, facilitated the standardization of indicators across studies, confirmed that the correct data were received from investigators, and resulted in a single, verified dataset for IPD-MA. The PRIME-IPD approach can assist researchers to systematically prepare, manage and conduct important quality checks on IPD from multiple studies for meta-analyses. Further testing of this framework in IPD-MA would be useful to refine these steps.
Fine-grained semantic type discovery for heterogeneous sources using clustering
We focus on the key task of semantic type discovery over a set of heterogeneous sources, an important data preparation task. We consider the challenging setting of multiple Web data sources in a vertical domain, which present sparsity of data and a high degree of heterogeneity, even internally within each individual source. We assume each source provides a collection of entity specifications, i.e. entity descriptions, each expressed as a set of attribute name-value pairs. Semantic type discovery aims at clustering individual attribute name-value pairs that represent the same semantic concept. We take advantage of the opportunities arising from the redundancy of information across such sources and propose the iterative RaF-STD solution, which consists of three key steps: (i) a Bayesian model analysis of overlapping information across sources to match the most locally homogeneous attributes; (ii) a tagging approach, inspired by NLP techniques, to create (virtual) homogeneous attributes from portions of heterogeneous attribute values; and (iii) a novel use of classical techniques based on matching of attribute names and domains. Empirical evaluation on the DI2KG and WDC benchmarks demonstrates the superiority of RaF-STD over alternative approaches adapted from the literature.