Catalogue Search | MBRL

A fast privacy-preserving patient record linkage of time series data

by Rajasekaran, Sanguthevar , Ravishanker, Nalini , Soliman, Ahmed in 639/705/117 , 639/705/531 , 692/700/478

2023

Recent advances in technology have led to an explosion of data in virtually all domains of our lives. Modern biomedical devices can acquire a large number of physical readings from patients. Often, these readings are stored in the form of time series data. Such time series data can form the basis for important research to advance healthcare and well being. Due to several considerations including data size, patient privacy, etc., the original, full data may not be available to secondary parties or researchers. Instead, suppose that a subset of the data is made available. A fast and reliable record linkage algorithm enables us to accurately match patient records in the original and subset databases while maintaining privacy. The problem of record linkage when the attributes include time series has not been studied much in the literature. We introduce two main contributions in this paper. First, we propose a novel, very efficient, and scalable record linkage algorithm that is employed on time series data. This algorithm is 400× faster than the previous work. Second, we introduce a privacy preserving framework that enables health institutions to safely release their raw time series records to researchers with bare minimum amount of identifying information.

Journal Article

Share this book

Add to My Shelf

Forecasting Robust Gaussian Process State Space Models for Assessing Intervention Impact in Internet of Things Time Series

by Rajasekaran, Sanguthevar , Lally, Nathan , Ravishanker, Nalini in Datasets , Energy consumption , Forecasting

2025

This article describes a robust Gaussian Prior process state space modeling (GPSSM) approach to assess the impact of an intervention in a time series. Numerous applications can benefit from this approach. Examples include: (1) time series could be the stock price of a company and the intervention could be the acquisition of another company; (2) the time series under concern could be the noise coming out of an engine, and the intervention could be a corrective step taken to reduce the noise; (3) the time series could be the number of visits to a web service, and the intervention is changes done to the user interface; and so on. The approach we describe in this article applies to any times series and intervention combination. It is well known that Gaussian process (GP) prior models provide flexibility by placing a non-parametric prior on the functional form of the model. While GPSSMs enable us to model a time series in a state space framework by placing a Gaussian Process (GP) prior over the state transition function, probabilistic recurrent state space models (PRSSM) employ variational approximations for handling complicated posterior distributions in GPSSMs. The robust PRSSMs (R-PRSSMs) discussed in this article assume a scale mixture of normal distributions instead of the usually proposed normal distribution. This assumption will accommodate heavy-tailed behavior or anomalous observations in the time series. On any exogenous intervention, we use R-PRSSM for Bayesian fitting and forecasting of the IoT time series. By comparing forecasts with the future internal temperature observations, we can assess with a high level of confidence the impact of an intervention. The techniques presented in this paper are very generic and apply to any time series and intervention combination. To illustrate our techniques clearly, we employ a concrete example. The time series of interest will be an Internet of Things (IoT) stream of internal temperatures measured by an insurance firm to address the risk of pipe-freeze hazard in a building. We treat the pipe-freeze hazard alert as an exogenous intervention. A comparison of forecasts and the future observed temperatures will be utilized to assess whether an alerted customer took preventive action to prevent pipe-freeze loss.

Journal Article

Share this book

Add to My Shelf

Randomized feature selection based semi-supervised latent Dirichlet allocation for microbiome analysis

by Rajasekaran, Sanguthevar , Weinstock, George , Ravishanker, Nalini in 639/705 , 692/308 , 692/699

2024

Health and disease are fundamentally influenced by microbial communities and their genes (the microbiome). An in-depth analysis of microbiome structure that enables the classification of individuals based on their health can be crucial in enhancing diagnostics and treatment strategies to improve the overall well-being of an individual. In this paper, we present a novel semi-supervised methodology known as Randomized Feature Selection based Latent Dirichlet Allocation (RFSLDA) to study the impact of the gut microbiome on a subject’s health status. Since the data in our study consists of fuzzy health labels, which are self-reported, traditional supervised learning approaches may not be suitable. As a first step, based on the similarity between documents in text analysis and gut-microbiome data, we employ Latent Dirichlet Allocation (LDA), a topic modeling approach which uses microbiome counts as features to group subjects into relatively homogeneous clusters, without invoking any knowledge of observed health status (labels) of subjects. We then leverage information from the observed health status of subjects to associate these clusters with the most similar health status making it a semi-supervised approach. Finally, a feature selection technique is incorporated into the model to improve the overall classification performance. The proposed method provides a semi-supervised topic modelling approach that can help handle the high dimensionality of the microbiome data in association studies. Our experiments reveal that our semi-supervised classification algorithm is effective and efficient in terms of high classification accuracy compared to popular supervised learning approaches like SVM and multinomial logistic model. The RFSLDA framework is attractive because it (i) enhances clustering accuracy by identifying key bacteria types as indicators of health status, (ii) identifies key bacteria types within each group based on estimates of the proportion of bacteria types within the groups, and (iii) computes a measure of within-group similarity to identify highly similar subjects in terms of their health status.

Journal Article

Share this book

Add to My Shelf

Feature Construction Using Persistence Landscapes for Clustering Noisy IoT Time Series

by Ravishanker, Nalini , Chen, Renjie in Algorithms , Climate change , Clustering

2023

With the advancement of IoT technologies, there is a large amount of data available from wireless sensor networks (WSN), particularly for studying climate change. Clustering long and noisy time series has become an important research area for analyzing this data. This paper proposes a feature-based clustering approach using topological data analysis, which is a set of methods for finding topological structure in data. Persistence diagrams and landscapes are popular topological summaries that can be used to cluster time series. This paper presents a framework for selecting an optimal number of persistence landscapes, and using them as features in an unsupervised learning algorithm. This approach reduces computational cost while maintaining accuracy. The clustering approach was demonstrated to be accurate on simulated data, based on only four, three, and three features, respectively, selected in Scenarios 1–3. On real data, consisting of multiple long temperature streams from various US locations, our optimal feature selection method achieved approximately a 13 times speed-up in computing.

Journal Article

Share this book

Add to My Shelf

Supervised Dynamic Correlated Topic Model for Classifying Categorical Time Series

by Rajasekaran, Sanguthevar , Ravishanker, Nalini , Pais, Namitha in Algorithms , Classification , Correlation

2024

In this paper, we describe the supervised dynamic correlated topic model (sDCTM) for classifying categorical time series. This model extends the correlated topic model used for analyzing textual documents to a supervised framework that features dynamic modeling of latent topics. sDCTM treats each time series as a document and each categorical value in the time series as a word in the document. We assume that the observed time series is generated by an underlying latent stochastic process. We develop a state-space framework to model the dynamic evolution of the latent process, i.e., the hidden thematic structure of the time series. Our model provides a Bayesian supervised learning (classification) framework using a variational Kalman filter EM algorithm. The E-step and M-step, respectively, approximate the posterior distribution of the latent variables and estimate the model parameters. The fitted model is then used for the classification of new time series and for information retrieval that is useful for practitioners. We assess our method using simulated data. As an illustration to real data, we apply our method to promoter sequence identification data to classify E. coli DNA sub-sequences by uncovering hidden patterns or motifs that can serve as markers for promoter presence.

Journal Article

Share this book

Add to My Shelf

Subsampling Algorithms for Irregularly Spaced Autoregressive Models

by Wang, Ziyang , Wang, HaiYing , Ravishanker, Nalini in Algorithms , Analysis , Astronomy

2024

With the exponential growth of data across diverse fields, applying conventional statistical methods directly to large-scale datasets has become computationally infeasible. To overcome this challenge, subsampling algorithms are widely used to perform statistical analyses on smaller, more manageable subsets of the data. The effectiveness of these methods depends on their ability to identify and select data points that improve the estimation efficiency according to some optimality criteria. While much of the existing research has focused on subsampling techniques for independent data, there is considerable potential for developing methods tailored to dependent data, particularly in time-dependent contexts. In this study, we extend subsampling techniques to irregularly spaced time series data which are modeled by irregularly spaced autoregressive models. We present frameworks for various subsampling approaches, including optimal subsampling under A-optimality, information-based optimal subdata selection, and sequential thinning on streaming data. These methods use A-optimality or D-optimality criteria to assess the usefulness of each data point and prioritize the inclusion of the most informative ones. We then assess the performance of these subsampling methods using numerical simulations, providing insights into their suitability and effectiveness for handling irregularly spaced long time series. Numerical results show that our algorithms have promising performance. Their estimation efficiency can be ten times as high as that of the uniform sampling estimator. They also significantly reduce the computational time and can be up to forty times faster than the full-data estimator.

Journal Article

Share this book

Add to My Shelf

Latent Autoregressive Student-t Prior Process Models to Assess Impact of Interventions in Time Series

by Rajasekaran, Sanguthevar , Lally, Nathan , Ravishanker, Nalini in Autoregressive models , Autoregressive processes , Boilers

2024

With the advent of the “Internet of Things” (IoT), insurers are increasingly leveraging remote sensor technology in the development of novel insurance products and risk management programs. For example, Hartford Steam Boiler’s (HSB) IoT freeze loss program uses IoT temperature sensors to monitor indoor temperatures in locations at high risk of water-pipe burst (freeze loss) with the goal of reducing insurances losses via real-time monitoring of the temperature data streams. In the event these monitoring systems detect a potentially risky temperature environment, an alert is sent to the end-insured (business manager, tenant, maintenance staff, etc.), prompting them to take remedial action by raising temperatures. In the event that an alert is sent and freeze loss occurs, the firm is not liable for any damages incurred by the event. For the program to be effective, there must be a reliable method of verifying if customers took appropriate corrective action after receiving an alert. Due to the program’s scale, direct follow up via text or phone calls is not possible for every alert event. In addition, direct feedback from customers is not necessarily reliable. In this paper, we propose the use of a non-linear, auto-regressive time series model, coupled with the time series intervention analysis method known as causal impact, to directly evaluate whether or not a customer took action directly from IoT temperature streams. Our method offers several distinct advantages over other methods as it is (a) readily scalable with continued program growth, (b) entirely automated, and (c) inherently less biased than human labelers or direct customer response. We demonstrate the efficacy of our method using a sample of actual freeze alert events from the freeze loss program.

Journal Article

Share this book

Add to My Shelf

Investigating the Joint Probability of High Coastal Sea Level and High Precipitation

by Ravishanker, Nalini , O’Donnell, James , Pais, Namitha Viona in Bivariate analysis , Coastal management , Coastal waters

2024

The design strategies for flood risk reduction in coastal towns must be informed by the likelihood of flooding resulting from both precipitation and coastal storm surge. This paper discusses various bivariate extreme value methods to investigate the joint probability of the exceedance of thresholds in both precipitation and sea level and estimate their dependence structure. We present the results of the dependence structure obtained using the observational record at Bridgeport, CT, a station with long data records representative of coastal Connecticut. Furthermore, we evaluate the dependence structure after removing the effects of harmonics in the sea level data. Through this comprehensive analysis, our study seeks to contribute to the understanding of the joint occurrence of sea level and precipitation extremes, providing insights that are crucial for effective coastal management.

Journal Article

Share this book

Add to My Shelf

Modeling Customer Lifetime Value

by Kahn, Wiliam , Hardie, Bruce , Hanssens, Dominique in Customer relations , Customer retention , Customers

2006

As modern economies become predominantly service-based, companies increasingly derive revenue from the creation and sustenance of long-term relationships with their customers. In such an environment, marketing serves the purpose of maximizing customer lifetime value (CLV) and customer equity, which is the sum of the lifetime values of the company’s customers. This article reviews a number of implementable CLV models that are useful for market segmentation and the allocation of marketing resources for acquisition, retention, and cross-selling. The authors review several empirical insights that were obtained from these models and conclude with an agenda of areas that are in need of further research.

Journal Article

Share this book

Add to My Shelf

Ensemble Hindcasting of Coastal Wave Heights

by O’Donnell, James , Ravishanker, Nalini , Shaffer, Ellis in Airports , Analysis , Autoregressive models

2023

Long records of wave parameters are central to the estimation of coastal flooding risk and the causes of coastal erosion. This paper leverages the predictive power of wave height history and correlations with wind speed and direction to build statistical models for time series of wave heights to develop a method to fill data-gaps and extend the record length of coastal wave observations. A threshold regression model is built where the threshold parameter, based on lagged wind speed, explains the nonlinear associations, and the lagged predictors in the model are based on a well-established empirical wind-wave relationship. The predictive model is completed by addressing the residual conditional heteroscedasticity using a GARCH model. This comprehensive model is trained on time series data from 2005 to 2013, using wave height and wind data both observed from a buoy in Long Island Sound. Subsequently, replacing wind data with observations from a nearby coastal station provides a similar level of predictive accuracy. This approach can be used to hindcast wave heights for past decades given only wind information at a coastal station. These hindcasts are used as a representative of the unobserved past to carry out extreme value analysis by fitting Generalized Pareto (GP) distribution in a peaks over threshold (POT) framework. By analyzing longer periods of data, we can obtain reliable return value estimates to help design better coastal protection structures.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter