Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
LanguageLanguage
-
SubjectSubject
-
Item TypeItem Type
-
DisciplineDiscipline
-
YearFrom:-To:
-
More FiltersMore FiltersIs Peer Reviewed
Done
Filters
Reset
969
result(s) for
"Massive data points"
Sort by:
Toward data-driven, dynamical complex systems approaches to disaster resilience
by
Cutter, Susan L.
,
Rao, P. Suresh C.
,
Yabe, Takahiro
in
Big Data
,
Complex systems
,
Disaster studies
2022
With rapid urbanization and increasing climate risks, enhancing the resilience of urban systems has never been more important. Despite the availability of massive datasets of human behavior (e.g., mobile phone data, satellite imagery), studies on disaster resilience have been limited to using static measures as proxies for resilience. However, static metrics have significant drawbacks such as their inability to capture the effects of compounding and accumulating disaster shocks; dynamic interdependencies of social, economic, and infrastructure systems; and critical transitions and regime shifts, which are essential components of the complex disaster resilience process. In this article, we argue that the disaster resilience literature needs to take the opportunities of big data and move toward a different research direction, which is to develop data-driven, dynamical complex systems models of disaster resilience. Data-driven complex systems modeling approaches could overcome the drawbacks of static measures and allow us to quantitatively model the dynamic recovery trajectories and intrinsic resilience characteristics of communities in a generic manner by leveraging large-scale and granular observations. This approach brings a paradigm shift in modeling the disaster resilience process and its linkage with the recovery process, paving the way to answering important questions for policy applications via counterfactual analysis and simulations.
Journal Article
Mitigating the Multicollinearity Problem and Its Machine Learning Approach: A Review
by
Cheng, Wai Khuen
,
Chan, Jireh Yi-Le
,
Leow, Steven Mun Hong
in
Artificial intelligence
,
Big Data
,
Data analysis
2022
Technologies have driven big data collection across many fields, such as genomics and business intelligence. This results in a significant increase in variables and data points (observations) collected and stored. Although this presents opportunities to better model the relationship between predictors and the response variables, this also causes serious problems during data analysis, one of which is the multicollinearity problem. The two main approaches used to mitigate multicollinearity are variable selection methods and modified estimator methods. However, variable selection methods may negate efforts to collect more data as new data may eventually be dropped from modeling, while recent studies suggest that optimization approaches via machine learning handle data with multicollinearity better than statistical estimators. Therefore, this study details the chronological developments to mitigate the effects of multicollinearity and up-to-date recommendations to better mitigate multicollinearity.
Journal Article
A Review of Local Outlier Factor Algorithms for Outlier Detection in Big Data Streams
2021
Outlier detection is a statistical procedure that aims to find suspicious events or items that are different from the normal form of a dataset. It has drawn considerable interest in the field of data mining and machine learning. Outlier detection is important in many applications, including fraud detection in credit card transactions and network intrusion detection. There are two general types of outlier detection: global and local. Global outliers fall outside the normal range for an entire dataset, whereas local outliers may fall within the normal range for the entire dataset, but outside the normal range for the surrounding data points. This paper addresses local outlier detection. The best-known technique for local outlier detection is the Local Outlier Factor (LOF), a density-based technique. There are many LOF algorithms for a static data environment; however, these algorithms cannot be applied directly to data streams, which are an important type of big data. In general, local outlier detection algorithms for data streams are still deficient and better algorithms need to be developed that can effectively analyze the high velocity of data streams to detect local outliers. This paper presents a literature review of local outlier detection algorithms in static and stream environments, with an emphasis on LOF algorithms. It collects and categorizes existing local outlier detection algorithms and analyzes their characteristics. Furthermore, the paper discusses the advantages and limitations of those algorithms and proposes several promising directions for developing improved local outlier detection methods for data streams.
Journal Article
CONCENTRATION OF TEMPERED POSTERIORS AND OF THEIR VARIATIONAL APPROXIMATIONS
2020
While Bayesian methods are extremely popular in statistics and machine learning, their application to massive data sets is often challenging, when possible at all. The classical MCMC algorithms are prohibitively slow when both the model dimension and the sample size are large. Variational Bayesian methods aim at approximating the posterior by a distribution in a tractable family 𝓕. Thus, MCMC are replaced by an optimization algorithm which is orders of magnitude faster. VB methods have been applied in such computationally demanding applications as collaborative filtering, image and video processing or NLP to name a few. However, despite nice results in practice, the theoretical properties of these approximations are not known. We propose a general oracle inequality that relates the quality of the VB approximation to the prior π and to the structure of 𝓕. We provide a simple condition that allows to derive rates of convergence from this oracle inequality. We apply our theory to various examples. First, we show that for parametric models with log-Lipschitz likelihood, Gaussian VB leads to efficient algorithms and consistent estimators. We then study a high-dimensional example: matrix completion, and a nonparametric example: density estimation.
Journal Article
Improving big citizen science data: Moving beyond haphazard sampling
by
Major, Richard E.
,
Rowley, Jodi J. L.
,
Callaghan, Corey T.
in
Bias
,
Biodiversity
,
Biology and Life Sciences
2019
Citizen science is mainstream: millions of people contribute data to a growing array of citizen science projects annually, forming massive datasets that will drive research for years to come. Many citizen science projects implement a \"leaderboard\" framework, ranking the contributions based on number of records or species, encouraging further participation. But is every data point equally \"valuable?\" Citizen scientists collect data with distinct spatial and temporal biases, leading to unfortunate gaps and redundancies, which create statistical and informational problems for downstream analyses. Up to this point, the haphazard structure of the data has been seen as an unfortunate but unchangeable aspect of citizen science data. However, we argue here that this issue can actually be addressed: we provide a very simple, tractable framework that could be adapted by broadscale citizen science projects to allow citizen scientists to optimize the marginal value of their efforts, increasing the overall collective knowledge.
Journal Article
A Multi-Resolution Approximation for Massive Spatial Datasets
2017
Automated sensing instruments on satellites and aircraft have enabled the collection of massive amounts of high-resolution observations of spatial fields over large spatial regions. If these datasets can be efficiently exploited, they can provide new insights on a wide variety of issues. However, traditional spatial-statistical techniques such as kriging are not computationally feasible for big datasets. We propose a multi-resolution approximation (M-RA) of Gaussian processes observed at irregular locations in space. The M-RA process is specified as a linear combination of basis functions at multiple levels of spatial resolution, which can capture spatial structure from very fine to very large scales. The basis functions are automatically chosen to approximate a given covariance function, which can be nonstationary. All computations involving the M-RA, including parameter inference and prediction, are highly scalable for massive datasets. Crucially, the inference algorithms can also be parallelized to take full advantage of large distributed-memory computing environments. In comparisons using simulated data and a large satellite dataset, the M-RA outperforms a related state-of-the-art method. Supplementary materials for this article are available online.
Journal Article
A literature review on one-class classification and its potential applications in big data
by
Abdollah Zadeh, Azadeh
,
Seliya, Naeem
,
Khoshgoftaar, Taghi M.
in
Application
,
Big Data
,
Class imbalance
2021
In severely imbalanced datasets, using traditional binary or multi-class classification typically leads to bias towards the class(es) with the much larger number of instances. Under such conditions, modeling and detecting instances of the minority class is very difficult. One-class classification (OCC) is an approach to detect abnormal data points compared to the instances of the known class and can serve to address issues related to severely imbalanced datasets, which are especially very common in big data. We present a detailed survey of OCC-related literature works published over the last decade, approximately. We group the different works into three categories: outlier detection, novelty detection, and deep learning and OCC. We closely examine and evaluate selected works on OCC such that a good cross section of approaches, methods, and application domains is represented in the survey. Commonly used techniques in OCC for outlier detection and for novelty detection, respectively, are discussed. We observed one area that has been largely omitted in OCC-related literature is its application context for big data and its inherently associated problems, such as severe class imbalance, class rarity, noisy data, feature selection, and data reduction. We feel the survey will be appreciated by researchers working in these areas of big data.
Journal Article
cyCombine allows for robust integration of single-cell cytometry datasets within and across technologies
by
Pedersen, Christina Bligaard
,
Nguyen, Jennifer
,
Leipold, Michael D.
in
631/114/2401
,
631/114/2784
,
631/114/794
2022
Combining single-cell cytometry datasets increases the analytical flexibility and the statistical power of data analyses. However, in many cases the full potential of co-analyses is not reached due to technical variance between data from different experimental batches. Here, we present cyCombine, a method to robustly integrate cytometry data from different batches, experiments, or even different experimental techniques, such as CITE-seq, flow cytometry, and mass cytometry. We demonstrate that cyCombine maintains the biological variance and the structure of the data, while minimizing the technical variance between datasets. cyCombine does not require technical replicates across datasets, and computation time scales linearly with the number of cells, allowing for integration of massive datasets. Robust, accurate, and scalable integration of cytometry data enables integration of multiple datasets for primary data analyses and the validation of results using public datasets.
Combining single-cell cytometry datasets increases the analytical flexibility and the statistical power of data analyses. Here, the authors present a method to robustly integrate cytometry data from different batches, experiments, or even different experimental techniques.
Journal Article
A Survey of Data Mining and Deep Learning in Bioinformatics
2018
The fields of medicine science and health informatics have made great progress recently and have led to in-depth analytics that is demanded by generation, collection and accumulation of massive data. Meanwhile, we are entering a new period where novel technologies are starting to analyze and explore knowledge from tremendous amount of data, bringing limitless potential for information growth. One fact that cannot be ignored is that the techniques of machine learning and deep learning applications play a more significant role in the success of bioinformatics exploration from biological data point of view, and a linkage is emphasized and established to bridge these two data analytics techniques and bioinformatics in both industry and academia. This survey concentrates on the review of recent researches using data mining and deep learning approaches for analyzing the specific domain knowledge of bioinformatics. The authors give a brief but pithy summarization of numerous data mining algorithms used for preprocessing, classification and clustering as well as various optimized neural network architectures in deep learning methods, and their advantages and disadvantages in the practical applications are also discussed and compared in terms of their industrial usage. It is believed that in this review paper, valuable insights are provided for those who are dedicated to start using data analytics methods in bioinformatics.
Journal Article
Uncertainty in big data analytics: survey, opportunities, and challenges
by
Hariri, Reihaneh H.
,
Bowers, Kate M.
,
Fredericks, Erik M.
in
Agriculture
,
Analytics
,
Artificial intelligence
2019
Big data analytics has gained wide attention from both academia and industry as the demand for understanding trends in massive datasets increases. Recent developments in sensor networks, cyber-physical systems, and the ubiquity of the Internet of Things (IoT) have increased the collection of data (including health care, social media, smart cities, agriculture, finance, education, and more) to an enormous scale. However, the data collected from sensors, social media, financial records, etc. is inherently uncertain due to noise, incompleteness, and inconsistency. The analysis of such massive amounts of data requires advanced analytical techniques for efficiently reviewing and/or predicting future courses of action with high precision and advanced decision-making strategies. As the amount, variety, and speed of data increases, so too does the uncertainty inherent within, leading to a lack of confidence in the resulting analytics process and decisions made thereof. In comparison to traditional data techniques and platforms, artificial intelligence techniques (including machine learning, natural language processing, and computational intelligence) provide more accurate, faster, and scalable results in big data analytics. Previous research and surveys conducted on big data analytics tend to focus on one or two techniques or specific application domains. However, little work has been done in the field of uncertainty when applied to big data analytics as well as in the artificial intelligence techniques applied to the datasets. This article reviews previous work in big data analytics and presents a discussion of open challenges and future directions for recognizing and mitigating uncertainty in this domain.
Journal Article