Catalogue Search | MBRL

Missing Data Imputation in Balanced Construction for Incomplete Block Designs

by Rios, Nicholas , Yu, Haiyan , Chen, Jianbin in Algorithms , Analysis , Data analysis

2024

Observational data with massive sample sizes are often distributed on many local machines. From an experimental design perspective, investigators often desire to identify the effect of new treatments (even ML algorithms) on many blocks of experimental data. With time requirements or budget constraints, assigning all treatments to each block is not always feasible. This creates incomplete responses with respect to a randomized complete block design (RCBD). These incomplete responses are missing by design. However, whether they can be estimated with missing imputation methods is not well understood. Thus, it is challenging to correctly identify the treatment effects with missing data. To this end, this paper provides a method for imputation and analysis of the responses with missing data. The proposed method consists of three steps: Reconstruction, Imputation, and ‘Complete’-data Analysis (RICA). The incomplete responses are imputed with the expectation-maximization (EM) algorithm. The RCBD model is then fitted by the resulting dataset. The identifiability result suggests that the missing may be nonignorable for each block, but the whole data of an incomplete design are missing by design when the design is balanced. Theoretical results on relative efficiency also inform us when the missingness should be imputed for incomplete designs with the role of balanced variance. Applications on real-world data verify the efficacy of this method.

Journal Article

Share this book

Add to My Shelf

EMPIRICAL LIKELIHOOD METHODS FOR COMPLEX SURVEYS WITH DATA MISSING-BY-DESIGN

by Wu, Changbao , Chen, Min , Thompson, Mary E.

2018

We consider nonrandomized pretest-posttest designs with complex survey data for observational studies. We show that two-sample pseudo empirical likelihood methods provide efficient inferences on the treatment effect, with a missing-by-design feature used for forming the two samples and the baseline information incorporated through suitable constraints. The proposed maximum pseudo empirical likelihood estimators of the treatment effect are consistent and pseudo empirical likelihood ratio confidence intervals are constructed through bootstrap calibration methods. The proposed methods require estimation of propensity scores which depend on the underlying missing-by-design mechanism. A simulation study was conducted to examine finite sample performances of the proposed methods under different scenarios of nonignorable and ignorable missing patterns. An application to the International Tobacco Control Policy Evaluation Project Four Country Surveys is also presented to demonstrate the use of the proposed methods for examining the mode effect in survey data collection.

Journal Article

Share this book

Add to My Shelf

Data Fusion for Joining Income and Consumtion Information using Different Donor-Recipient Distance Metrics

by Meinfelder, Florian , Schaller, Jannik in Data integration , Data sources , Daten

2022

Data fusion describes the method of combining data from (at least) two initially independent data sources to allow for joint analysis of variables which are not jointly observed. The fundamental idea is to base inference on identifying assumptions, and on common variables which provide information that is jointly observed in all the data sources. A popular class of methods dealing with this particular missing-data problem in practice is based on covariate-based nearest neighbour matching, whereas more flexible semi- or even fully parametric approaches seem underrepresented in applied data fusion. In this article we compare two different approaches of nearest neighbour hot deck matching: One, Random Hot Deck, is a variant of the covariate-based matching methods which was proposed by Eurostat, and can be considered as a ’classical’ statistical matching method, whereas the alternative approach is based on Predictive Mean Matching. We discuss results from a simulation study where we deviate from previous analyses of marginal distributions and consider joint distributions of fusion variables instead, and our findings suggest that Predictive Mean Matching tends to outperform Random Hot Deck.

Journal Article

Share this book

Add to My Shelf

The Missing Indicator Approach for Accelerated Failure Time Model with Covariates Subject to Limits of Detection

by Chiou, Sy Han , Alyabs, Norah in Bias , complete-case analysis , Data analysis

2022

The limit of detection (LOD) is commonly encountered in observational studies when one or more covariate values fall outside the measuring ranges. Although the complete-case (CC) approach is widely employed in the presence of missing values, it could result in biased estimations or even become inapplicable in small sample studies. On the other hand, approaches such as the missing indicator (MDI) approach are attractive alternatives as they preserve sample sizes. This paper compares the effectiveness of different alternatives to the CC approach under different LOD settings with a survival outcome. These alternatives include substitution methods, multiple imputation (MI) methods, MDI approaches, and MDI-embedded MI approaches. We found that the MDI approach outperformed its competitors regarding bias and mean squared error in small sample sizes through extensive simulation.

Journal Article

Share this book

Add to My Shelf

Estimation within the new integrated system of household surveys in Germany

by Meinfelder, Florian , Navvabpour, Hamidreza , Münnich, Ralf in Data integration , Households , Labor force

2020

In 2015, the European Commission has drafted a framework regulation for integrated European social statistics. This integration covers the Labour Force Survey, the Statistics on Income and Living conditions, and others. In order to avoid an inappropriate response burden, administrative and other sources shall be considered to achieve accurate survey estimates. Combining information from different data sources has become a field of growing research interest among statistical offices and other institutions. In the statistical literature this problem is known as data fusion or statistical matching, and is widely considered as a particular missing-data pattern. Assuming that budgets are limited, and that only some additional information can be obtained to improve the quality of the data fusion, we investigate different scenarios of using these limited resources within an integrated system of household surveys. Our main objective is to develop a framework that fosters on the one hand the estimation of statistical models using several surveys, and on the other hand classical totals for different sub-classes and areas which are of special interest for official statistics.

Journal Article

Share this book

Add to My Shelf

An imputation based empirical likelihood approach to pretest-posttest studies

by Wu, Changbao , Chen, Min , Thompson, Mary E. in 62G20 , Availability , Baseline information

2015

Pretest–posttest studies are an important and popular method for assessing treatment effects or the effectiveness of an intervention in many areas of scientific research. There are two distinct features for this type of study: availability of baseline information for all subjects in the study and missingness by design of measures of the responses. Important recent research advances on this topic include Leon et al. (2003) on efficient estimation of the treatment effect, and Huang et al. (2008) on a semi-parametric estimation procedure based on empirical likelihood (EL) where the mean responses for the treatment group and the control group are handled separately. EL ratio confidence intervals or tests for the treatment effect, however, cannot be constructed under the approach used by Huang et al. (2008). In this paper, we use an alternative EL formulation, which directly involves the parameter of interest, i.e., the treatment effect, and incorporates baseline information through an imputation approach. Our focus is to derive the EL ratio confidence intervals and tests for the treatment effect under the proposed imputation-based framework. Theoretical results are developed, and finite sample performances of the proposed methods with comparison to existing approaches are investigated through simulation studies. An application to a real data set is also presented. Les études prétest/post-test représentent une méthode populaire et importante pour l'évaluation de l'effet d'un traitement ou de l'efficacité d'une intervention dans plusieurs domaines de recherche scientifique. La disponibilité d'information de référence pour tous les sujets et la présence de valeurs manquantes dues à la méthode de mesure de la variable réponse constituent deux caractéristiques propres à ces études. Récemment, des avancées importantes ont été accomplies par Leon et coll. (2003) au sujet de l'estimation efficace de l'effet thérapeutique, et par Huang et coll. (2008) à propos d'une procédure d'estimation semi-paramétrique basée sur la vraisemblance empirique où les réponses moyennes des groupes expérimental et témoin sont considérées séparément. Les tests et intervalles de confiance basés sur la vraisemblance empirique ne peuvent toutefois pas être construits dans ce cadre. Les auteurs utilisent une formulation différente de la vraisemblance empirique qui contient le paramètre d'intérêt, soit l'effet thérapeutique, et qui tient compte de l'information de référence par une méthode d'imputation. Leur objectif consiste à dériver du rapport de vraisemblance empirique des tests et intervalles de confiance pour l'effet thérapeutique sous le modèle proposé. Ils développent des résultats théoriques et évaluent la performance de leur méthode par rapport aux méthodes existantes sur des échantillons finis à l'aide de simulations. Finalement, les auteurs appliquent leur méthode à l'analyse d'un jeu de données réelles.

Journal Article

Share this book

Add to My Shelf

Resource Allocation Among Simulation Time Steps

by Glasserman, Paul , Staum, Jeremy in Analysis of variance , Budget allocation , Cash flow

2003

Motivated by the problem of efficient estimation of expected cumulative rewards or cashflows, this paper proposes and analyzes a variance reduction technique for estimating the expectation of the sum of sequentially simulated random variables. In some applications, simulation effort is of greater value when applied to early time steps rather than shared equally among all time steps; this occurs, for example, when discounting renders immediate rewards or cashflows more important than those in the future. This suggests that deliberately stopping some paths early may improve efficiency. We formulate and solve the problem of optimal allocation of resources to time horizons with the objective of minimizing variance subject to a cost constraint. The solution has a simple characterization in terms of the convex hull of points defined by the covariance matrix of the cashflows. We also develop two ways to enhance variance reduction through early stopping. One takes advantage of the statistical theory of missing data. The other redistributes the cumulative sum to make optimal use of early stopping.

Journal Article

Share this book

Add to My Shelf

Designing and integrating composite networks for monitoring multivariate gaussian pollution fields

by Le, N. D. , Zidek, J. V. , Sun, W. in Air pollution , Bayesian method , Cost estimates

2000

Networks of ambient monitoring stations are used to monitor environmental pollution fields such as those for acid rain and air pollution. Such stations provide regular measurements of pollutant concentrations. The networks are established for a variety of purposes at various times so often several stations measuring different subsets of pollutant concentrations can be found in compact geographical regions. The problem of statistically combining these disparate information sources into a single `network' then arises. Capitalizing on the efficiencies so achieved can then lead to the secondary problem of extending this network. The subject of this paper is a set of 31 air pollution monitoring stations in southern Ontario. Each of these regularly measures a particular subset of ionic sulphate, sulphite, nitrite and ozone. However, this subset varies from station to station. For example only two stations measure all four. Some measure just one. We describe a Bayesian framework for integrating the measurements of these stations to yield a spatial predictive distribution for unmonitored sites and unmeasured concentrations at existing stations. Furthermore we show how this network can be extended by using an entropy maximization criterion. The methods assume that the multivariate response field being measured has a joint Gaussian distribution conditional on its mean and covariance function. A conjugate prior is used for these parameters, some of its hyperparameters being fitted empirically.

Journal Article

Share this book

Add to My Shelf

Evaluating the Effect of Planned Missing Designs in Structural Equation Model Fit Measures

by Vicente, Paula C R in Estimates , Missing data , Monte Carlo simulation

2023

In a planned missing design, the nonresponses occur according to the researcher’s will, with the goal of increasing data quality and avoiding overly extensive questionnaires. When adjusting a structural equation model to the data, there are different criteria to evaluate how the theoretical model fits the observed data, with the root mean square error of approximation (RMSEA), standardized root mean square residual (SRMR), comparative fit index (CFI) and Tucker–Lewis index (TLI) being the most common. Here, I explore the effect of the nonresponses due to a specific planned missing design—the three-form design—on the mentioned fit indices when adjusting a structural equation model. A simulation study was conducted with correctly specified model and one model with misspecified correlation between factors. The CFI, TLI and SRMR indices are affected by the nonresponses, particularly with small samples, low factor loadings and numerous observed variables. The existence of nonresponses when considering misspecified models causes unacceptable values for all the four fit indexes under analysis, namely when a strong correlation between factors is considered. The results shown here were performed with the simsem package in R and the full information maximum-likelihood method was used for handling missing data during model fitting.

Journal Article

Share this book

Add to My Shelf

Complex long-term dynamics of pollinator abundance in undisturbed Mediterranean montane habitats over two decades

by Herrera, Carlos M. in Abundance , annual variation , Anthropogenic factors

2019

Current notions of \"pollinator decline\" and \"pollination crisis\" mainly arose from studies on pollinators of economic value in anthropogenic ecosystems of mid-latitude temperate regions. Comprehensive long-term pollinator data from biologically diverse, undisturbed communities are needed to evaluate the actual extent of the so-called \"global pollination crisis.\" This paper analyzes the long-term dynamics of pollinator abundance in undisturbed Mediterranean montane habitats using pollinator visitation data for 65 plant species collected over two decades. Objectives are (1) to elucidate patterns of long-term changes in pollinator abundance from the perspectives of individual plant species, major pollinator groups, and the whole plant community and (2) to propose a novel methodological implementation based on combining a planned missing data design with the analytical strength of mixed effects models, which allows one to draw community-wide inferences on long-term pollinator trends in species-rich natural habitats. Probabilistic measurements (\"patch visitation probability\" and \"flower visitation probability\" per time unit) were used to assess pollinator functional abundance for each plant species on two separate, randomly chosen years. A total of 13,054 pollinator censuses accounting for a total watching effort of 2,877,039 flower-min were carried out on 299 different dates. Supra-annual unstability in pollinator functional abundance was the rule, with visitation probability to flowering patches and/or individual flowers exhibiting significant heterogeneity between years in the majority of plant species (83%). At the plantcommunity level, there was a significant linear increase in pollinator functional abundance over the study period. Probability of pollinator visitation to flowering patches and individual flowers increased due to increasing visitation by small solitary bees and, to a lesser extent, small beetles. Visitation to different plant species exhibited contrasting changes, and insect orders and genera differed widely in sign and magnitude of linear abundance trends, thus exemplifying the complex dynamics of community-wide changes in pollinator functional abundance. Results of this investigation indicate that pollinator declines are not universal beyond anthropogenic ecosystems; stress the need for considering broader ecological scenarios and comprehensive samples of plants and pollinators; and illustrate the crucial importance of combining ambitious sampling designs with powerful analytical schemes to draw reliable inferences on pollinator trends at the plant community level.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter