Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
52,323
result(s) for
"Data science methodologies"
Sort by:
Meta-analysis accelerator: a comprehensive tool for statistical data conversion in systematic reviews with meta-analysis
by
Abbas, Abdallah
,
Hefnawy, Mahmoud Tarek
,
Negida, Ahmed
in
Accuracy
,
Data analysis
,
Data conversion
2024
Background
Systematic review with meta-analysis integrates findings from multiple studies, offering robust conclusions on treatment effects and guiding evidence-based medicine. However, the process is often hampered by challenges such as inconsistent data reporting, complex calculations, and time constraints. Researchers must convert various statistical measures into a common format, which can be error-prone and labor-intensive without the right tools.
Implementation
Meta-Analysis Accelerator was developed to address these challenges. The tool offers 21 different statistical conversions, including median & interquartile range (IQR) to mean & standard deviation (SD), standard error of the mean (SEM) to SD, and confidence interval (CI) to SD for one and two groups, among others. It is designed with an intuitive interface, ensuring that users can navigate the tool easily and perform conversions accurately and efficiently. The website structure includes a home page, conversion page, request a conversion feature, about page, articles page, and privacy policy page. This comprehensive design supports the tool’s primary goal of simplifying the meta-analysis process.
Results
Since its initial release in October 2023 as Meta Converter and subsequent renaming to Meta-Analysis Accelerator, the tool has gained widespread use globally. From March 2024 to May 2024, it received 12,236 visits from countries such as Egypt, France, Indonesia, and the USA, indicating its international appeal and utility. Approximately 46% of the visits were direct, reflecting its popularity and trust among users.
Conclusions
Meta-Analysis Accelerator significantly enhances the efficiency and accuracy of meta-analysis of systematic reviews by providing a reliable platform for statistical data conversion. Its comprehensive variety of conversions, user-friendly interface, and continuous improvements make it an indispensable resource for researchers. The tool’s ability to streamline data transformation ensures that researchers can focus more on data interpretation and less on manual calculations, thus advancing the quality and ease of conducting systematic reviews and meta-analyses.
Journal Article
An automated approach to predict diabetic patients using KNN imputation and effective data mining techniques
2024
Diabetes is thought to be the most common illness in underdeveloped nations. Early detection and competent medical care are crucial steps in reducing the effects of diabetes. Examining the signs associated with diabetes is one of the most effective ways to identify the condition. The problem of missing data is not very well investigated in existing works. In addition, existing studies on diabetes detection lack accuracy and robustness. The available datasets frequently contain missing information for the automated detection of diabetes, which might negatively impact machine learning model performance. This work suggests an automated diabetes prediction method that achieves high accuracy and effectively manages missing variables in order to address this problem. The proposed strategy employs a stacked ensemble voting classifier model with three machine learning models. and a KNN Imputer to handle missing values. Using the KNN imputer, the suggested model performs exceptionally well, with accuracy, precision, recall, F1 score, and MCC of 98.59%, 99.26%, 99.75%, 99.45%, and 99.24%, respectively. In two scenarios one with missing values eliminated and the other with KNN imputer, the study thoroughly compared the suggested model with seven other machine learning techniques. The outcomes demonstrate the superiority of the suggested model over current state-of-the-art methods and confirm its efficacy. This work demonstrates the capability of KNN imputer and looks at the problem of missing values for diabetes detection. Medical professionals can utilize the results to improve care for diabetes patients and discover problems early.
Journal Article
Effects of missing data imputation methods on univariate blood pressure time series data analysis and forecasting with ARIMA and LSTM
by
Niako, Nicholas
,
Maestre, Gladys E.
,
Melgarejo, Jesus D.
in
Algorithms
,
Ambulatory blood pressure
,
Analysis
2024
Background
Missing observations within the univariate time series are common in real-life and cause analytical problems in the flow of the analysis. Imputation of missing values is an inevitable step in every incomplete univariate time series. Most of the existing studies focus on comparing the distributions of imputed data. There is a gap of knowledge on how different imputation methods for univariate time series affect the forecasting performance of time series models. We evaluated the prediction performance of autoregressive integrated moving average (ARIMA) and long short-term memory (LSTM) network models on imputed time series data using ten different imputation techniques.
Methods
Missing values were generated under missing completely at random (MCAR) mechanism at 10%, 15%, 25%, and 35% rates of missingness using complete data of 24-h ambulatory diastolic blood pressure readings. The performance of the mean, Kalman filtering, linear, spline, and Stineman interpolations, exponentially weighted moving average (EWMA), simple moving average (SMA), k-nearest neighborhood (KNN), and last-observation-carried-forward (LOCF) imputation techniques on the time series structure and the prediction performance of the LSTM and ARIMA models were compared on imputed and original data.
Results
All imputation techniques either increased or decreased the data autocorrelation and with this affected the forecasting performance of the ARIMA and LSTM algorithms. The best imputation technique did not guarantee better predictions obtained on the imputed data. The mean imputation, LOCF, KNN, Stineman, and cubic spline interpolations methods performed better for a small rate of missingness. Interpolation with EWMA and Kalman filtering yielded consistent performances across all scenarios of missingness. Disregarding the imputation methods, the LSTM resulted with a slightly better predictive accuracy among the best performing ARIMA and LSTM models; otherwise, the results varied. In our small sample, ARIMA tended to perform better on data with higher autocorrelation.
Conclusions
We recommend to the researchers that they consider Kalman smoothing techniques, interpolation techniques (linear, spline, and Stineman), moving average techniques (SMA and EWMA) for imputing univariate time series data as they perform well on both data distribution and forecasting with ARIMA and LSTM models. The LSTM slightly outperforms ARIMA models, however, for small samples, ARIMA is simpler and faster to execute.
Journal Article
MATLAB for neuroscientists : an introduction to scientific computing in MATLAB
by
Dickey, Adam Seth
,
Benayoun, Marc D
,
Lusignan, Michael E
in
Computer science -- Methodology
,
Data processing
,
MATLAB
2014,2013,2008
This is the first comprehensive teaching resource and textbook for the teaching of Matlab in the Neurosciences and in Psychology. Matlab is unique in that it can be used to learn the entire empirical and experimental process, including stimulus generation, experimental control, data collection, data analysis and modeling. Thus a wide variety of computational problems can be addressed in a single programming environment. The idea is to empower advanced undergraduates and beginning graduate students by allowing them to design and implement their own analytical tools. As students advance in their research careers, they will have achieved the fluency required to understand and adapt more specialized tools as opposed to treating them as \"black boxes\".
A novel MissForest-based missing values imputation approach with recursive feature elimination in medical applications
2024
Background
Missing values in datasets present significant challenges for data analysis, particularly in the medical field where data accuracy is crucial for patient diagnosis and treatment. Although MissForest (MF) has demonstrated efficacy in imputation research and recursive feature elimination (RFE) has proven effective in feature selection, the potential for enhancing MF through RFE integration remains unexplored.
Methods
This study introduces a novel imputation method, “recursive feature elimination-MissForest” (RFE-MF), designed to enhance imputation quality by reducing the impact of irrelevant features. A comparative analysis is conducted between RFE-MF and four classical imputation methods: mean/mode, k-nearest neighbors (kNN), multiple imputation by chained equations (MICE), and MF. The comparison is carried out across ten medical datasets containing both numerical and mixed data types. Different missing data rates, ranging from 10 to 50%, are evaluated under the missing completely at random (MCAR) mechanism. The performance of each method is assessed using two evaluation metrics: normalized root mean squared error (NRMSE) and predictive fidelity criterion (PFC). Additionally, paired samples
t
-tests are employed to analyze the statistical significance of differences among the outcomes.
Results
The findings indicate that RFE-MF demonstrates superior performance across the majority of datasets when compared to four classical imputation methods (mean/mode, kNN, MICE, and MF). Notably, RFE-MF consistently outperforms the original MF, irrespective of variable type (numerical or categorical). Mean/mode imputation exhibits consistent performance across various scenarios. Conversely, the efficacy of kNN imputation fluctuates in relation to varying missing data rates.
Conclusion
This study demonstrates that RFE-MF holds promise as an effective imputation method for medical datasets, providing a novel approach to addressing missing data challenges in medical applications.
Journal Article
Predicting diabetes in adults: identifying important features in unbalanced data over a 5-year cohort study using machine learning algorithm
by
Jahani, Yones
,
Talebi Moghaddam, Maryam
,
Arefzadeh, Zahra
in
Adult
,
Algorithm-level method
,
Algorithms
2024
Background
Imbalanced datasets pose significant challenges in predictive modeling, leading to biased outcomes and reduced model reliability. This study addresses data imbalance in diabetes prediction using machine learning techniques. Utilizing data from the Fasa Adult Cohort Study (FACS) with a 5-year follow-up of 10,000 participants, we developed predictive models for Type 2 diabetes.
Methods
We employed various data-level and algorithm-level interventions, including SMOTE, ADASYN, SMOTEENN, Random Over Sampling and KMeansSMOTE, paired with Random Forest, Gradient Boosting, Decision Tree and Multi-Layer Perceptron (MLP) classifier. We evaluated model performance using F1 score, AUC, and G-means—metrics chosen to provide a comprehensive assessment of model accuracy, discrimination ability, and overall balance in performance, particularly in the context of imbalanced datasets.
Results
our study uncovered key factors influencing diabetes risk and evaluated the performance of various machine learning models. Feature importance analysis revealed that the most influential predictors of diabetes differ between males and females. For females, the most important factors are triglyceride (TG), basal metabolic rate (BMR), and total cholesterol (CHOL), whereas for males, the key predictors are body Mass Index (BMI), serum glutamate Oxaloacetate Transaminase (SGOT), and Gamma-Glutamyl (GGT). Across the entire dataset, BMI remains the most important variable, followed by SGOT, BMR, and energy intake. These insights suggest that gender-specific risk profiles should be considered in diabetes prevention and management strategies. In terms of model performance, our results show that ADASYN with MLP classifier achieved an F1 score of 82.17 ± 3.38, AUC of 89.61 ± 2.09, and G-means of 89.15 ± 2.31. SMOTE with MLP followed closely with an F1 score of 79.85 ± 3.91, AUC of 89.7 ± 2.54, and G-means of 89.31 ± 2.78. The SMOTEENN with Random Forest combination achieved an F1 score of 78.27 ± 1.54, AUC of 87.18 ± 1.12, and G-means of 86.47 ± 1.28.
Conclusion
These combinations effectively address class imbalance, improving the accuracy and reliability of diabetes predictions. The findings highlight the importance of using appropriate data-balancing techniques in medical data analysis.
Journal Article
Simulating hierarchical data to assess the utility of ecological versus multilevel analyses in obtaining individual-level causal effects
by
Arnold, Kellyn F.
,
Lawniczak, Wiktoria
,
Kakampakou, Lydia
in
Agent-based modelling
,
Big data
,
Causal inference
2025
Understanding causality, over mere association, is vital for researchers wishing to inform policy and decision making – for example, when seeking to improve population health outcomes. Yet, contemporary causal inference methods have not fully tackled the complexity of data hierarchies, such as the clustering of people within households, neighbourhoods, cities, or regions. However, complex data hierarchies are the rule rather than the exception. Gaining an understanding of these hierarchies is important for complex population outcomes, such as non-communicable disease, which is impacted by various social determinants at different levels of the data hierarchy. The alternative of analysing aggregated data could introduce well-known biases, such as the ecological fallacy or the modifiable areal unit problem. We devise a hierarchical causal diagram that encodes the multilevel data generating mechanism anticipated when evaluating non-communicable diseases in a population. The causal diagram informs data simulation. We also provide a flexible tool to generate synthetic population data that captures all multilevel causal structures, including a cross-level effect due to cluster size. For the very first time, we can then quantify the ecological fallacy within a formal causal framework to show that individual-level data are essential to assess causal relationships that affect the individual. This study also illustrates the importance of causally structured synthetic data for use with other methods, such as Agent Based Modelling or Microsimulation Modelling. Many methodological challenges remain for robust causal evaluation of multilevel data, but this study provides a foundation to investigate these.
Journal Article
Artificial intelligence methods applied to longitudinal data from electronic health records for prediction of cancer: a scoping review
by
de Kamps, Marc
,
Moglia, Victoria
,
Cook, Gordon
in
Artificial Intelligence
,
Cancer
,
Data mining
2025
Background
Early detection and diagnosis of cancer are vital to improving outcomes for patients. Artificial intelligence (AI) models have shown promise in the early detection and diagnosis of cancer, but there is limited evidence on methods that fully exploit the longitudinal data stored within electronic health records (EHRs). This review aims to summarise methods currently utilised for prediction of cancer from longitudinal data and provides recommendations on how such models should be developed.
Methods
The review was conducted following PRISMA-ScR guidance. Six databases (MEDLINE, EMBASE, Web of Science, IEEE Xplore, PubMed and SCOPUS) were searched for relevant records published before 2/2/2024. Search terms related to the concepts “artificial intelligence”, “prediction”, “health records”, “longitudinal”, and “cancer”. Data were extracted relating to several areas of the articles: (1) publication details, (2) study characteristics, (3) input data, (4) model characteristics, (4) reproducibility, and (5) quality assessment using the PROBAST tool. Models were evaluated against a framework for terminology relating to reporting of cancer detection and risk prediction models.
Results
Of 653 records screened, 33 were included in the review; 10 predicted risk of cancer, 18 performed either cancer detection or early detection, 4 predicted recurrence, and 1 predicted metastasis. The most common cancers predicted in the studies were colorectal (
n
= 9) and pancreatic cancer (
n
= 9). 16 studies used feature engineering to represent temporal data, with the most common features representing trends. 18 used deep learning models which take a direct sequential input, most commonly recurrent neural networks, but also including convolutional neural networks and transformers. Prediction windows and lead times varied greatly between studies, even for models predicting the same cancer. High risk of bias was found in 90% of the studies. This risk was often introduced due to inappropriate study design (
n
= 26) and sample size (
n
= 26).
Conclusion
This review highlights the breadth of approaches to cancer prediction from longitudinal data. We identify areas where reporting of methods could be improved, particularly regarding where in a patients’ trajectory the model is applied. The review shows opportunities for further work, including comparison of these approaches and their applications in other cancers.
Journal Article
A generative model for evaluating missing data methods in large epidemiological cohorts
by
Smith, Stephen M.
,
Radosavljević, Lav
,
Nichols, Thomas E.
in
Algorithms
,
Biobanks
,
Biological Specimen Banks - statistics & numerical data
2025
Background
The potential value of large scale datasets is constrained by the ubiquitous problem of missing data, arising in either a structured or unstructured fashion. When imputation methods are proposed for large scale data, one limitation is the simplicity of existing evaluation methods. Specifically, most evaluations create synthetic data with only a simple, unstructured missing data mechanism which does not resemble the missing data patterns found in real data. For example, in the UK Biobank missing data tends to appear in blocks, because non-participation in one of the sub-studies leads to missingness for all sub-study variables.
Methods
We propose a tool for generating mixed type missing data mimicking key properties of a given real large scale epidemiological data set with both structured and unstructured missingness while accounting for informative missingness. The process involves identifying sub-studies using hierarchical clustering of missingness patterns and modelling the dependence of inter-variable correlation and co-missingness patterns.
Results
On the UK Biobank brain imaging cohort, we identify several large blocks of missing data. We demonstrate the use of our tool for evaluating several imputation methods, showing modest accuracy of imputation overall, with iterative imputation having the best performance. We compare our evaluations based on synthetic data to an exemplar study which includes variable selection on a single real imputed dataset, finding only small differences between the imputation methods though with iterative imputation leading to the most informative selection of variables.
Conclusions
We have created a framework for simulating large scale data with that captures the complexities of the inter-variable dependence as well as structured and unstructured informative missingness. Evaluations using this framework highlight the immense challenge of data imputation in this setting and the need for improved missing data methods.
Journal Article
The reporting quality and methodological quality of dynamic prediction models for cancer prognosis
2025
Background
To evaluate the reporting quality and methodological quality of dynamic prediction model (DPM) studies on cancer prognosis.
Methods
Extensive search for DPM studies on cancer prognosis was conducted in MEDLINE, EMBASE, and the Cochrane Library databases. The Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) and the Prediction model Risk of Bias Assessment Tool (PROBAST) were used to assess reporting quality and methodological quality, respectively.
Results
A total of 34 DPM studies were identified since the first publication in 2005, the main modeling methods for DPMs included the landmark model and the joint model. Regarding the reporting quality, the median overall TRIPOD adherence score was 75%. The TRIPOD items were poorly reported, especially the title (23.53%), model specification, including presentation (55.88%) and interpretation (50%) of the DPM usage, and implications for clinical use and future research (29.41%). Concerning methodological quality, most studies were of low quality (
n
= 30) or unclear (
n
= 3), mainly due to statistical analysis issues.
Conclusions
The Landmark model and joint model show potential in DPM. The suboptimal reporting and methodological qualities of current DPM studies should be improved to facilitate clinical application.
Journal Article