Catalogue Search | MBRL

A Hybrid Framework for Real-Time Data Drift and Anomaly Identification Using Hierarchical Temporal Memory and Statistical Tests

by Bose, Joy , Chowdhury, Sujoy Roy , Bandyopadhyay, Subhadip in ai powered data drift detection , data drift detection , Distance learning

2025

Data Drift refers to the phenomenon where the generating model behind the data changes over time. Due to data drift, any model built on the past training data becomes less relevant and inaccurate over time. Thus, detecting and controlling for data drift is critical in machine learning models. Hierarchical Temporal Memory (HTM) is a machine learning model developed by Jeff Hawkins, inspired by how the human brain processes information. It is a biologically inspired model of memory similar in structure to the neocortex and whose performance is claimed to be comparable to state of the art models in detecting anomalies in time series data. Another unique benefit of HTMs is their independence from training and testing cycles; all the learning takes place online with streaming data, and no separate training and testing cycle is required. In the sequential learning paradigm, the Sequential Probability Ratio Test (SPRT) offers unique benefits for online learning and inference. This paper proposes a novel hybrid framework combining HTM and SPRT for real-time data drift detection and anomaly identification. Unlike existing data drift methods, our approach eliminates frequent retraining and ensures low false positive rates. HTMs currently work with one dimensional or univariate data. In a second study, we also propose an application of HTM in a multidimensional supervised scenario for anomaly detection by combining the outputs of multiple HTM columns, one for each data dimension, through a neural network. Experimental evaluations demonstrate that the proposed method outperforms conventional drift detection techniques like the Kolmogorov-Smirnov (KS) test, Wasserstein distance, and Population Stability Index (PSI) in terms of accuracy, adaptability, and computational efficiency. Our experiments also provide insights into optimizing hyperparameters for real-time deployment in domains such as Telecom.

Journal Article

Share this book

Add to My Shelf

A Novel Method for Drift Detection in Streaming Data Based on Measurement of Changes in Feature Ranks

by Wrobel, Krzysztof , Dadzie, Benjamin Mensah , Orczyk, Tomasz in Algorithms , Comparative studies , Data analysis

2025

Hidden changes in the data stream are unknown to learning algorithms and are referred to in the literature as drifts of various types. The accuracy of the classifier may degrade due to the occurrence of drift in non-stationary data streams. In such situations, the classifier must detect significant data changes and adjust its predictions. This article aims to present a new method of drift detection based on analyzing changes in feature ranks across adjacent chunks of data. The proposed strategy involves determining the ranking of the most important feature and tracking its fluctuations within the chunks into which the input data stream is divided. Changes in feature rankings between adjacent chunks serve as symptoms of data drift. The Least Absolute Shrinkage and Selection Operator (LASSO) procedure was proposed as an efficient rank pointer. We compared well-known and popular drift detection algorithms, such as the Drift Detection Method (DDM), Early Drift Detection Method (EDDM), ADaptive WINdowing (ADWIN), and Principal Component Analysis Feature Drift Detection (PCA-FDD), with our approach in comparative studies. The tests were conducted on different artificial data streams (sudden, gradual, recurring, and incremental) as well as real data. Comparative studies were performed on both two-class and multi-class datasets. The experiments confirm that the proposed feature drift detection strategy produces valuable results.

Journal Article

Share this book

Add to My Shelf

A generalized three-tier hybrid model for classifying unseen (IoT devices) in smart home environments

by Wan Din, Wan Isni Sofiah , Waseem, Quadri , Aamir, Muhammad in 639/166 , 639/705 , Accuracy

2025

Data drift caused due to network changes, new device additions, or model degradation alters the patterns learned by ML/DL models, resulting in poor classification performance. This creates the need for a generalized, drift-resilient model that can learn without retraining in dynamic environments. To maintain high accuracy, such a model must classify previously unseen IoT devices effectively. In this study, we propose a three-tier incremental architecture (CNN-PN-RF) combining Convolutional Neural Network (CNN) for feature extraction, Prototypical Network (PN) for class embedding, and Random Forest (RF) for robust classification. The model utilizes six aggregated diverse IoT datasets.Two similarly structured datasets (Dataset 1 and Dataset 2) were created from it, differing in training-testing splits, with some device CSV files withheld to test on unseen classification. Phase 1 employs a stand-alone CNN-based model with L2 regularization, dropout, and early stopping, achieving 70.96% accuracy. Phase 2 integrates CNN with RF, using SMOTE for class balancing and PCA for dimensionality reduction, attaining 83.79% accuracy. Phase 3 introduces PN to finalize the CNN-PN-RF model, enhancing classification issue of feature clustering, intra-class separability, and small-class support. Final accuracy, precision, recall, and F1-score were 99.56%, 99.66%, 99.56%, and 99.59% for Dataset 1, and 99.80% for all metrics on Dataset 2. The model was compared with state-of-the-art approaches and validated on unseen IoT subsets of both datasets, showing better generalization capability.

Journal Article

Share this book

Add to My Shelf

Principal graph embedding convolutional recurrent network for traffic flow prediction

by Deng, Hao , Han, Yang , Jia, Wenzhen in Algorithms , Drift , Embedding

2023

As an essential part of traffic management, traffic flow prediction attracts worldwide attention to intelligent traffic systems (ITSs). Complicated spatial dependencies due to the well-connected road networks and time-varying traffic dynamics make this problem extremely challenging. Recent works have focused on modeling this complicated spatial-temporal dependence through graph neural networks with a fixed weighted graph or an adaptive adjacency matrix. However, fixed graph methods cannot address data drift due to changes in the road network structure, and adaptive methods are time consuming and prone to be overfitting because the learning algorithm thoroughly optimizes the adaptive matrix. To address this issue, we propose a principal graph embedding convolutional recurrent network (PGECRN) for accurate traffic flow prediction. First, we propose the adjacency matrix graph embedding (AMGE) generation algorithm to solve the data drift problem. AMGE can model the distribution of spatiotemporal series after data drift by extracting the principal components of the original adjacency matrix and performing an adaptive transformation. At the same time, it has fewer parameters, alleviating overfitting. Then, except for the essential spatial correlations, traffic flow data are also temporally dynamic. We utilize temporal variation by integrating gated recurrent units (GRU) and AMGE to comprise the proposed model. Finally, PGECRN is evaluated on two real-world highway datasets, PeMSD4 and PeMSD8. Compared with the existing baselines, the better prediction accuracy of our model shows that it can accurately and efficiently model traffic flow.

Journal Article

Share this book

Add to My Shelf

Detecting drifts in data streams using Kullback-Leibler (KL) divergence measure for data engineering applications

by Allali, Mohamed , Kurian, Jeomoan Francis in Artificial Intelligence , Automation , Business and Management

2024

The exponential growth of data coupled with the widespread application of artificial intelligence(AI) presents organizations with challenges in upholding data accuracy, especially within data engineering functions. While the Extraction, Transformation, and Loading process addresses error-free data ingestion, validating the content within data streams remains a challenge. Prompt detection and remediation of data issues are crucial, especially in automated analytical environments driven by AI. To address these issues, this study focuses on detecting drifts in data distributions and divergence within data fields processed from different sample populations. Using a hypothetical banking scenario, we illustrate the impact of data drift on automated decision-making processes. We propose a scalable method leveraging the Kullback-Leibler (KL) divergence measure, specifically the Population Stability Index (PSI), to detect and quantify data drift. Through comprehensive simulations, we demonstrate the effectiveness of PSI in identifying and mitigating data drift issues. This study contributes to enhancing data engineering functions in organizations by offering a scalable solution for early drift detection in data ingestion pipelines. We discuss related research works, identify gaps, and present the methodology and experiment results, underscoring the importance of robust data governance practices in mitigating risks associated with data drift and improving data observability.

Journal Article

Share this book

Add to My Shelf

Overview of Wind and Photovoltaic Data Stream Classification and Data Drift Issues

by Wu, Yang , Wu, Yelong , Liu, Shuangquan in Algorithms , Alternative energy sources , Big Data

2024

The development in the fields of clean energy, particularly wind and photovoltaic power, generates a large amount of data streams, and how to mine valuable information from these data to improve the efficiency of power generation has become a hot spot of current research. Traditional classification algorithms cannot cope with dynamically changing data streams, so data stream classification techniques are particularly important. The current data stream classification techniques mainly include decision trees, neural networks, Bayesian networks, and other methods, which have been applied to wind power and photovoltaic power data processing in existing research. However, the data drift problem is gradually highlighted due to the dynamic change in data, which significantly impacts the performance of classification algorithms. This paper reviews the latest research on data stream classification technology in wind power and photovoltaic applications. It provides a detailed introduction to the data drift problem in machine learning, which significantly affects algorithm performance. The discussion covers covariate drift, prior probability drift, and concept drift, analyzing their potential impact on the practical deployment of data stream classification methods in wind and photovoltaic power sectors. Finally, by analyzing examples for addressing data drift in energy-system data stream classification, the article highlights the future prospects of data drift research in this field and suggests areas for improvement. Combined with the systematic knowledge of data stream classification techniques and data drift handling presented, it offers valuable insights for future research.

Journal Article

Share this book

Add to My Shelf

Time Series Data Modeling Using Advanced Machine Learning and AutoML

by Sonia, Sonia , Iwendi, Celestine , Alsharef, Ahmad in Accuracy , Algorithms , Automation

2022

A prominent area of data analytics is “timeseries modeling” where it is possible to forecast future values for the same variable using previous data. Numerous usage examples, including the economy, the weather, stock prices, and the development of a corporation, demonstrate its significance. Experiments with time series forecasting utilizing machine learning (ML), deep learning (DL), and AutoML are conducted in this paper. Its primary contribution consists of addressing the forecasting problem by experimenting with additional ML and DL models and AutoML frameworks and expanding the AutoML experimental knowledge. In addition, it contributes by breaking down barriers found in past experimental studies in this field by using more sophisticated methods. The datasets this empirical research utilized were secondary quantitative data of the real prices of the currently most used cryptocurrencies. We found that AutoML for timeseries is still in the development stage and necessitates more study to be a viable solution since it was unable to outperform manually designed ML and DL models. The demonstrated approaches may be utilized as a baseline for predicting timeseries data.

Journal Article

Share this book

Add to My Shelf

Explainability and Interpretability in Concept and Data Drift: A Systematic Literature Review

by Pelosi, Daniele , Piangerelli, Marco , Cacciagrano, Diletta in Adaptation , Analysis , Artificial intelligence

2025

Explainability and interpretability have emerged as essential considerations in machine learning, particularly as models become more complex and integral to a wide range of applications. In response to increasing concerns over opaque “black-box” solutions, the literature has seen a shift toward two distinct yet often conflated paradigms: explainable AI (XAI), which refers to post hoc techniques that provide external explanations for model predictions, and interpretable AI, which emphasizes models whose internal mechanisms are understandable by design. Meanwhile, the phenomenon of concept and data drift—where models lose relevance due to evolving conditions—demands renewed attention. High-impact events, such as financial crises or natural disasters, have highlighted the need for robust interpretable or explainable models capable of adapting to changing circumstances. Against this backdrop, our systematic review aims to consolidate current research on explainability and interpretability with a focus on concept and data drift. We gather a comprehensive range of proposed models, available datasets, and other technical aspects. By synthesizing these diverse resources into a clear taxonomy, we intend to provide researchers and practitioners with actionable insights and guidance for model selection, implementation, and ongoing evaluation. Ultimately, this work aspires to serve as a practical roadmap for future studies, fostering further advancements in transparent, adaptable machine learning systems that can meet the evolving needs of real-world applications.

Journal Article

Share this book

Add to My Shelf

A Fine-Grained Defect Prediction Method Based on Drift-Immune Graph Neural Networks

by Zhong, Fa , Wei, Xiaohui , Yang, Fengyu in Defects , Drift , Graph neural networks

2025

The primary goal of software defect prediction (SDP) is to pinpoint code modules that are likely to contain defects, thereby enabling software quality assurance teams to strategically allocate their resources and manpower. Within-project defect prediction (WPDP) is a widely used method in SDP. Despite various improvements, current methods still face challenges such as coarse-grained prediction and ineffective handling of data drift due to differences in project distribution. To address these issues, we propose a fine-grained SDP method called DIDP (drift-immune defect prediction), based on drift-immune graph neural networks (DI-GNN). DIDP converts source code into graph representations and uses DI-GNN to mitigate data drift at the model level. It also analyses key statements leading to file defects for a more detailed SDP approach. We evaluated the performance of DIDP in WPDP by examining its file-level and statement-level accuracy compared to state-of-the-art methods, and by examining its cross-project prediction accuracy. The results of the experiment show that DIDP showed significant improvements in F1-score and Recall@Top20%LOC compared to existing methods, even with large software version changes. DIDP also performed well in cross-project SDP. Our study demonstrates that DIDP achieves impressive prediction results in WPDP, effectively mitigating data drift and accurately predicting defective files. Additionally, DIDP can rank the risk of statements in defective files, aiding developers and testers in identifying potential code issues.

Journal Article

Share this book

Add to My Shelf

AML4S: An AutoML Pipeline for Data Streams

by Kalaitzidis, Eleftherios , Symeonidis, Andreas L. , Michailoudis, Athanasios in Algorithms , AutoML , Big Data

2025

The data landscape has changed, as more and more information is produced in the form of continuous data streams instead of stationary datasets. In this context, several online machine learning techniques have been proposed with the aim of automatically adapting to changes in data distributions, known as drifts. Though effective in certain scenarios, contemporary techniques do not generalize well to different types of data, while they also require manual parameter tuning, thus significantly hindering their applicability. Moreover, current methods do not thoroughly address drifts, as they mostly focus on concept drifts (distribution shifts on the target variable) and not on data drifts (changes in feature distributions). To confront these challenges, in this paper, we propose an AutoML Pipeline for Streams (AML4S), which automates the choice of preprocessing techniques, the choice of machine learning models, and the tuning of hyperparameters. Our pipeline further includes a drift detection mechanism that identifies different types of drifts, therefore continuously adapting the underlying models. We assess our pipeline on several real and synthetic data streams, including a data stream that we crafted to focus on data drifts. Our results indicate that AML4S produces robust pipelines and outperforms existing online learning or AutoML algorithms.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter