Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
3,112
result(s) for
"Data cleaning"
Sort by:
How to Address the Data Quality Issues in Regression Models: A Guided Process for Data Cleaning
2018
Today, data availability has gone from scarce to superabundant. Technologies like IoT, trends in social media and the capabilities of smart-phones are producing and digitizing lots of data that was previously unavailable. This massive increase of data creates opportunities to gain new business models, but also demands new techniques and methods of data quality in knowledge discovery, especially when the data comes from different sources (e.g., sensors, social networks, cameras, etc.). The data quality process of the data set proposes conclusions about the information they contain. This is increasingly done with the aid of data cleaning approaches. Therefore, guaranteeing a high data quality is considered as the primary goal of the data scientist. In this paper, we propose a process for data cleaning in regression models (DC-RM). The proposed data cleaning process is evaluated through a real datasets coming from the UCI Repository of Machine Learning Databases. With the aim of assessing the data cleaning process, the dataset that is cleaned by DC-RM was used to train the same regression models proposed by the authors of UCI datasets. The results achieved by the trained models with the dataset produced by DC-RM are better than or equal to that presented by the datasets’ authors.
Journal Article
Best practice recommendations for data screening
by
DeSimone, Justin A.
,
DeSimone, Alice J.
,
Harms, P. D.
in
Best practice
,
Computation
,
Credibility
2015
Survey respondents differ in their levels of attention and effort when responding to items. There are a number of methods researchers may use to identify respondents who fail to exert sufficient effort in order to increase the rigor of analysis and enhance the trustworthiness of study results. Screening techniques are organized into three general categories, which differ in impact on survey design and potential respondent awareness. Assumptions and considerations regarding appropriate use of screening techniques are discussed along with descriptions of each technique. The utility of each screening technique is a function of survey design and administration. Each technique has the potential to identify different types of insufficient effort. An example dataset is provided to illustrate these differences and familiarize readers with the computation and implementation of the screening techniques. Researchers are encouraged to consider data screening when designing a survey, select screening techniques on the basis of theoretical considerations (or empirical considerations when pilot testing is an option), and report the results of an analysis both before and after employing data screening techniques.
Journal Article
Data cleaning and machine learning: a systematic literature review
by
Nikanjam, Amin
,
Khomh, Foutse
,
Humeniuk, Dmytro
in
Artificial Intelligence
,
Cleaning
,
Computer Science
2024
Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing interest in approaches to detect and repair data errors (i.e., data cleaning). Researchers are also exploring how ML can be used for data cleaning; hence creating a dual relationship between ML and data cleaning. To the best of our knowledge, there is no study that comprehensively reviews this relationship. This paper’s objectives are twofold. First, it aims to summarize the latest approaches for data cleaning for ML and ML for data cleaning. Second, it provides future work recommendations. We conduct a systematic literature review of the papers published between 2016 and 2022 inclusively. We identify different types of data cleaning activities with and for ML: feature cleaning, label cleaning, entity matching, outlier detection, imputation, and holistic data cleaning. We summarize the content of 101 papers covering various data cleaning activities and provide 24 future work recommendations. Our review highlights many promising data cleaning techniques that can be further extended. We believe that our review of the literature will help the community develop better approaches to clean data.
Journal Article
Scopus as a curated, high-quality bibliometric data source for academic research in quantitative science studies
by
Karimi, Reza
,
Schotten, Michiel
,
Baas, Jeroen
in
abstract and citation database
,
Algorithms
,
Application programming interface
2020
Scopus is among the largest curated abstract and citation databases, with a wide global and regional coverage of scientific journals, conference proceedings, and books, while ensuring only the highest quality data are indexed through rigorous content selection and re-evaluation by an independent Content Selection and Advisory Board. Additionally, extensive quality assurance processes continuously monitor and improve all data elements in Scopus. Besides enriched metadata records of scientific articles, Scopus offers comprehensive author and institution profiles, obtained from advanced profiling algorithms and manual curation, ensuring high precision and recall. The trustworthiness of Scopus has led to its use as bibliometric data source for large-scale analyses in research assessments, research landscape studies, science policy evaluations, and university rankings. Scopus data have been offered for free for selected studies by the academic research community, such as through application programming interfaces, which have led to many publications employing Scopus data to investigate topics such as researcher mobility, network visualizations, and spatial bibliometrics. In June 2019, the International Center for the Study of Research was launched, with an advisory board consisting of bibliometricians, aiming to work with the scientometric research community and offering a virtual laboratory where researchers will be able to utilize Scopus data.
Journal Article
Study on Preprocessing Method of TCM Prescription Data in Data Mining
2021
Traditional Chinese medicine (TCM) prescriptions have been developed for thousands of years. Data forms are diverse, content is discrete and missing, and there are many uncertainties due to cultural and regional differences. Therefore, it has brought some difficulties to the mining of TCM prescriptions. Data based on the 3108 prescriptions for the treatment of typhoid fever, for example, is given priority to with data cleaning and data transformation of data preprocessing, prescriptions combined with multiple functions, expounds the unqualified prescriptions data cleansing, drug name normalization, dose for solving the problems of the unification, the data structured method, make the processed data can be effectively mining, It provides a strong support for exploring the compatibility law of prescription and the development of new drugs.
Journal Article
Increasing the Power of Your Study by Increasing the Effect Size
As in other social sciences, published findings in consumer research tend to overestimate the size of the effect being investigated, due to both file drawer effects and abuse of researcher degrees of freedom, including opportunistic analysis decisions. Given that most effect sizes are substantially smaller than would be apparent from published research, there has been a widespread call to increase power by increasing sample size. We propose that, aside from increasing sample size, researchers can also increase power by boosting the effect size. If done correctly, removing participants, using covariates, and optimizing experimental designs, stimuli, and measures can boost effect size without inflating researcher degrees of freedom. In fact, careful planning of studies and analyses to maximize effect size is essential to be able to study many psychologically interesting phenomena when massive sample sizes are not feasible.
Journal Article
SMOTE-WENN: Solving class imbalance and small sample problems by oversampling and distance scaling
2021
Many practical applications suffer from imbalanced data classification, in which case the minority class has degraded recognition rate. The primary causes are the sample scarcity of the minority class and the intrinsic complex distribution characteristics of imbalanced datasets. The imbalanced classification problem is more serious on small sample datasets. To solve the problems of small sample and class imbalance, a hybrid resampling method is proposed. The proposed method combines an oversampling approach (synthetic minority oversampling technique, SMOTE) and a novel data cleaning approach (weighted edited nearest neighbor rule, WENN). First, SMOTE generates synthetic minority class examples using linear interpolation. Then, WENN detects and deletes unsafe majority and minority class examples using weighted distance function and k-nearest neighbor (kNN) rule. The weighted distance function scales up a commonly used distance by considering local imbalance and spacial sparsity. Extensive experiments over synthetic and real datasets validate the superiority of the proposed SMOTE-WENN compared with three state-of-the-art resampling methods.
Journal Article
Tools for Educational Data Mining: A Review
by
Gasevic, Dragan
,
Slater, Stefan
,
Kovanovic, Vitomir
in
Bayesian Statistics
,
Computer Software
,
Computer Uses in Education
2017
In recent years, a wide array of tools have emerged for the purposes of conducting educational data mining (EDM) and/or learning analytics (LA) research. In this article, we hope to highlight some of the most widely used, most accessible, and most powerful tools available for the researcher interested in conducting EDM/LA research. We will highlight the utility that these tools have with respect to common data preprocessing and analysis steps in a typical research project as well as more descriptive information such as price point and user-friendliness. We will also highlight niche tools in the field, such as those used for Bayesian knowledge tracing (BKT), data visualization, text analysis, and social network analysis. Finally, we will discuss the importance of familiarizing oneself with multiple tools—a data analysis toolbox—for the practice of EDM/LA research.
Journal Article
PUResNet: prediction of protein-ligand binding sites using deep residual neural network
by
Tayara, Hilal
,
Chong, Kil To
,
Kandel, Jeevan
in
Artificial neural networks
,
Binding site prediction
,
Binding sites
2021
Background
Predicting protein-ligand binding sites is a fundamental step in understanding the functional characteristics of proteins, which plays a vital role in elucidating different biological functions and is a crucial step in drug discovery. A protein exhibits its true nature after binding to its interacting molecule known as a ligand that binds only in the favorable binding site of the protein structure. Different computational methods exploiting the features of proteins have been developed to identify the binding sites in the protein structure, but none seems to provide promising results, and therefore, further investigation is required.
Results
In this study, we present a deep learning model PUResNet and a novel data cleaning process based on structural similarity for predicting protein-ligand binding sites. From the whole scPDB (an annotated database of druggable binding sites extracted from the Protein DataBank) database, 5020 protein structures were selected to address this problem, which were used to train PUResNet. With this, we achieved better and justifiable performance than the existing methods while evaluating two independent sets using distance, volume and proportion metrics.
Journal Article
IoT in Healthcare: Achieving Interoperability of High-Quality Data Acquired by IoT Medical Devices
by
Pitsios, Stamatios
,
Mavrogiorgou, Argyro
,
Perakis, Konstantinos
in
Automation
,
data cleaning
,
Data collection
2019
It is an undeniable fact that Internet of Things (IoT) technologies have become a milestone advancement in the digital healthcare domain, since the number of IoT medical devices is grown exponentially, and it is now anticipated that by 2020 there will be over 161 million of them connected worldwide. Therefore, in an era of continuous growth, IoT healthcare faces various challenges, such as the collection, the quality estimation, as well as the interpretation and the harmonization of the data that derive from the existing huge amounts of heterogeneous IoT medical devices. Even though various approaches have been developed so far for solving each one of these challenges, none of these proposes a holistic approach for successfully achieving data interoperability between high-quality data that derive from heterogeneous devices. For that reason, in this manuscript a mechanism is produced for effectively addressing the intersection of these challenges. Through this mechanism, initially, the collection of the different devices’ datasets occurs, followed by the cleaning of them. In sequel, the produced cleaning results are used in order to capture the levels of the overall data quality of each dataset, in combination with the measurements of the availability of each device that produced each dataset, and the reliability of it. Consequently, only the high-quality data is kept and translated into a common format, being able to be used for further utilization. The proposed mechanism is evaluated through a specific scenario, producing reliable results, achieving data interoperability of 100% accuracy, and data quality of more than 90% accuracy.
Journal Article