Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Reading LevelReading Level
-
Content TypeContent Type
-
YearFrom:-To:
-
More FiltersMore FiltersItem TypeIs Full-Text AvailableSubjectPublisherSourceDonorLanguagePlace of PublicationContributorsLocation
Done
Filters
Reset
19,361
result(s) for
"data inference"
Sort by:
Virtual Collection for Distributed Photovoltaic Data: Challenges, Methodologies, and Applications
2022
In recent years, with the rapid development of distributed photovoltaic systems (DPVS), the shortage of data monitoring devices and the difficulty of comprehensive coverage of measurement equipment has become more significant, bringing great challenges to the efficient management and maintenance of DPVS. Virtual collection is a new DPVS data collection scheme with cost-effectiveness and computational efficiency that meets the needs of distributed energy management but lacks attention and research. To fill the gap in the current research field, this paper provides a comprehensive and systematic review of DPVS virtual collection. We provide a detailed introduction to the process of DPVS virtual collection and identify the challenges faced by virtual collection through problem analogy. Furthermore, in response to the above challenges, this paper summarizes the main methods applicable to virtual collection, including similarity analysis, reference station selection, and PV data inference. Finally, this paper thoroughly discusses the diversified application scenarios of virtual collection, hoping to provide helpful information for the development of the DPVS industry.
Journal Article
Applications of Deep Learning to Ocean Data Inference and Subgrid Parameterization
2019
Oceanographic observations are limited by sampling rates, while ocean models are limited by finite resolution and high viscosity and diffusion coefficients. Therefore, both data from observations and ocean models lack information at small and fast scales. Methods are needed to either extract information, extrapolate, or upscale existing oceanographic data sets, to account for or represent unresolved physical processes. Here we use machine learning to leverage observations and model data by predicting unresolved turbulent processes and subsurface flow fields. As a proof of concept, we train convolutional neural networks on degraded data from a high‐resolution quasi‐geostrophic ocean model. We demonstrate that convolutional neural networks successfully replicate the spatiotemporal variability of the subgrid eddy momentum forcing, are capable of generalizing to a range of dynamical behaviors, and can be forced to respect global momentum conservation. The training data of our convolutional neural networks can be subsampled to 10–20% of the original size without a significant decrease in accuracy. We also show that the subsurface flow field can be predicted using only information at the surface (e.g., using only satellite altimetry data). Our results indicate that data‐driven approaches can be exploited to predict both subgrid and large‐scale processes, while respecting physical principles, even when data are limited to a particular region or external forcing. Our in‐depth study presents evidence for the successful design of ocean eddy parameterizations for implementation in coarse‐resolution climate models. Plain Language Summary Models of the ocean and ocean observations are imperfect. Due to this imperfection, simulations of the ocean and our observations are not quite the same as the true ocean currents. We, therefore, need ways to make our ocean data more realistic and complete and to make it more similar to the actual ocean. Scientists have traditionally approached this problem in a pen‐and‐paper style, considering physical theories and mechanisms. This study instead uses machine learning, which focuses on data as opposed to equations on a black board. We successfully use a particular type of machine learning algorithm, called a convolutional neural network, to make the most of current oceanographic data. This type of neural network works well even if ocean data are limited to a particular area. Future work will involve combining machine learning with physical theories of the ocean. Key Points We successfully use convolutional neural networks to predict unresolved turbulent processes and subsurface velocities The neural networks generalize to different regions, dynamical regimes, and forcing Global momentum conservation for eddy parameterization can be respected without sacrificing accuracy
Journal Article
EM-AUC: A Novel Algorithm for Evaluating Anomaly Based Network Intrusion Detection Systems
by
Bai, Kevin Z.
,
Fossaceca, John M.
in
Algorithms
,
Area Under the Precision-Recall Curve
,
Area Under the Roc Curve
2025
Effective network intrusion detection using anomaly scores from unsupervised machine learning models depends on the performance of the models. Although unsupervised models do not require labels during the training and testing phases, the assessment of their performance metrics during the evaluation phase still requires comparing anomaly scores against labels. In real-world scenarios, the absence of labels in massive network datasets makes it infeasible to calculate performance metrics. Therefore, it is valuable to develop an algorithm that calculates robust performance metrics without using labels. In this paper, we propose a novel algorithm, Expectation Maximization-Area Under the Curve (EM-AUC), to derive the Area Under the ROC Curve (AUC-ROC) and the Area Under the Precision-Recall Curve (AUC-PR) by treating the unavailable labels as missing data and replacing them through their posterior probabilities. This algorithm was applied to two network intrusion datasets, yielding robust results. To the best of our knowledge, this is the first time AUC-ROC and AUC-PR, derived without labels, have been used to evaluate network intrusion detection systems. The EM-AUC algorithm enables model training, testing, and performance evaluation to proceed without comprehensive labels, offering a cost-effective and scalable solution for selecting the most effective models for network intrusion detection.
Journal Article
Data analytics to advance the inference of origin–destination in public transport systems: tracing network vulnerabilities and age-sensitive trip purposes
by
Arsenio, Elisabete
,
Barateiro, José
,
Henriques, Rui
in
Adults
,
Alighting data inference
,
Automotive Engineering
2025
Knowing the passengers’ final destinations, underlying motifs, and commuting habits is critical to optimise public transportation systems, guide urban planning and contribute to a more sustainable urban mobility. In entry-only Automated Fare Collection systems, the body of literature has focused on the spatial dimension by estimating alighting stops, overlooking the inference of robust alighting times. Moreover, discriminating between transfers and activities is pivotal for determining their ultimate destinations. However, current methods often struggle to adapt to the stochastic nature of passenger behaviour, further disregarding the multiplicity of routes and stops to access specific facilities and individual motivations. Further research is required to address an effective spatio-temporal and contextual inference in both challenges. With the above concerns in mind, this research uses data analytics to propose an enhanced methodology for the inference of OD matrices, with the final goal of providing a comprehensive view of OD mobility patterns across distinct age-sensitive profiles—youth, adults, and older adults. Our methodological framework integrates the following approaches: (i) alighting stop-and-time inference, (ii) ensembled model for transfer classification, (iii) indicators retrieved from statistical analysis of network vulnerabilities (e.g., number of transfers, walkability needs), frequent destinations and their underlying putative motifs against the city amenities and others points-of-interest. The reliability of alighting data (timestamp and location) inference is improved by integrating OpenStreetMap data and the past boarding data from bus and railway systems. Considering Lisbon as the target study case, we apply the methodology over smart card data collected both from metro and bus systems. A comparative analysis with state-of-the-art methods revealed that the enhanced framework for alighting and OD inference led to longer journey times for trips. Furthermore, throughout the day, the older adult group experiences longer transfer times on average compared to both the children and young adult segment and the adult segment.
Journal Article
Comparison of home detection algorithms using smartphone GPS data
2024
Estimation of people’s home locations using location-based services data from smartphones is a common task in human mobility assessment. However, commonly used home detection algorithms (HDAs) are often arbitrary and unexamined. In this study, we review existing HDAs and examine five HDAs using eight high-quality mobile phone geolocation datasets. These include four commonly used HDAs as well as an HDA proposed in this work. To make quantitative comparisons, we propose three novel metrics to assess the quality of detected home locations and test them on eight datasets across four U.S. cities. We find that all three metrics show a consistent rank of HDAs’ performances, with the proposed HDA outperforming the others. We infer that the temporal and spatial continuity of the geolocation data points matters more than the overall size of the data for accurate home detection. We also find that HDAs with high (and similar) performance metrics tend to create results with better consistency and closer to common expectations. Further, the performance deteriorates with decreasing data quality of the devices, though the patterns of relative performance persist. Finally, we show how the differences in home detection can lead to substantial differences in subsequent inferences using two case studies—(i) hurricane evacuation estimation, and (ii) correlation of mobility patterns with socioeconomic status. Our work contributes to improving the transparency of large-scale human mobility assessment applications.
Journal Article
The computing of the Poisson multinomial distribution and applications in ecological inference and machine learning
2023
The Poisson multinomial distribution (PMD) describes the distribution of the sum of n independent but non-identically distributed random vectors, in which each random vector is of length m with 0/1 valued elements and only one of its elements can take value 1 with a certain probability. Those probabilities are different for the m elements across the n random vectors, and form an n×m matrix with row sum equals to 1. We call this n×m matrix the success probability matrix (SPM). Each SPM uniquely defines a PMD. The PMD is useful in many areas such as, voting theory, ecological inference, and machine learning. The distribution functions of PMD, however, are usually difficult to compute and there is no efficient algorithm available for computing it. In this paper, we develop efficient methods to compute the probability mass function (pmf) for the PMD using multivariate Fourier transform, normal approximation, and simulations. We study the accuracy and efficiency of those methods and give recommendations for which methods to use under various scenarios. We also illustrate the use of the PMD via three applications, namely, in ecological inference, uncertainty quantification in classification, and voting probability calculation. We build an R package that implements the proposed methods, and illustrate the package with examples. This paper has online supplementary materials.
Journal Article
A flexible linear temporal logic-based data inference architecture for industrial process prediction
by
Tang, Qianneng
,
Liu, Qiqi
,
Zhang, Haoyu
in
Accuracy
,
Artificial neural networks
,
Belief networks
2026
Artificial neural networks, particularly recurrent architectures like Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), are widely used for time series prediction. However, achieving high accuracy, especially for complex industrial process data, remains challenging, as standard models may not explicitly capture local temporal trends effectively. This paper proposes a novel data inference structure based on Propositional Linear Temporal Logic (PLTL) designed to capture qualitative data trends within a sliding window. This PLTL inference structure is integrated into LSTM and GRU networks, creating two new models: L-LSTM and L-GRU. The PLTL module provides an interpretable, logic-based representation of recent data dynamics, which modulates the standard recurrent computations, enabling the networks to learn temporal dependencies more effectively, as demonstrated by empirical results. The proposed methods are evaluated on the TAIEX financial benchmark dataset and real-world data from a textile manufacturing process. Experimental results, evaluated using Root Mean Square Error (RMSE), indicate that the L-LSTM and L-GRU models demonstrate statistically significant improved prediction accuracy compared to baseline LSTM, GRU, Deep Belief Networks (DBN), and a Fuzzy Time Series-LSTM (FTS-LSTM) hybrid model on the benchmark dataset, and show strong performance on the industrial data.
Journal Article
Toward Quantitative Models in Safety Assessment: A Case Study to Show Impact of Dose–Response Inference on hERG Inhibition Models
by
Hasselgren, Catrin
,
Melnikov, Fjodor
,
Anger, Lennart T.
in
Computer Simulation
,
Ether-A-Go-Go Potassium Channels - genetics
,
Potassium Channel Blockers - pharmacology
2022
Due to challenges with historical data and the diversity of assay formats, in silico models for safety-related endpoints are often based on discretized data instead of the data on a natural continuous scale. Models for discretized endpoints have limitations in usage and interpretation that can impact compound design. Here, we present a consistent data inference approach, exemplified on two data sets of Ether-à-go-go-Related Gene (hERG) K+ inhibition data, for dose–response and screening experiments that are generally applicable for in vitro assays. hERG inhibition has been associated with severe cardiac effects and is one of the more prominent safety targets assessed in drug development, using a wide array of in vitro and in silico screening methods. In this study, the IC50 for hERG inhibition is estimated from diverse historical proprietary data. The IC50 derived from a two-point proprietary screening data set demonstrated high correlation (R = 0.98, MAE = 0.08) with IC50s derived from six-point dose–response curves. Similar IC50 estimation accuracy was obtained on a public thallium flux assay data set (R = 0.90, MAE = 0.2). The IC50 data were used to develop a robust quantitative model. The model’s MAE (0.47) and R2 (0.46) were on par with literature statistics and approached assay reproducibility. Using a continuous model has high value for pharmaceutical projects, as it enables rank ordering of compounds and evaluation of compounds against project-specific inhibition thresholds. This data inference approach can be widely applicable to assays with quantitative readouts and has the potential to impact experimental design and improve model performance, interpretation, and acceptance across many standard safety endpoints.
Journal Article
Causal inference and the data-fusion problem
2016
We review concepts, principles, and tools that unify current approaches to causal analysis and attend to new challenges presented by big data. In particular, we address the problem of data fusion—piecing together multiple datasets collected under heterogeneous conditions (i.e., different populations, regimes, and sampling methods) to obtain valid answers to queries of interest. The availability of multiple heterogeneous datasets presents new opportunities to big data analysts, because the knowledge that can be acquired from combined data would not be possible from any individual source alone. However, the biases that emerge in heterogeneous environments require new analytical tools. Some of these biases, including confounding, sampling selection, and cross-population biases, have been addressed in isolation, largely in restricted parametric models. We here present a general, nonparametric framework for handling these biases and, ultimately, a theoretical solution to the problem of data fusion in causal inference tasks.
Journal Article