The anchoring method: Estimation of interviewer effects in the absence of interpenetrated sample assignment
Methodological studies of the effects that human interviewers have on the quality of survey data have long been limited by a critical assumption: that interviewers in a given survey are assigned random subsets of the larger overall sample (also known as interpenetrated assignment). Absent this type of study design, estimates of interviewer effects on survey measures of interest may reflect differences between interviewers in the characteristics of their assigned sample members, rather than recruitment or measurement effects specifically introduced by the interviewers. Previous attempts to approximate interpenetrated assignment have typically used regression models to condition on factors that might be related to interviewer assignment. We introduce a new approach for overcoming this lack of interpenetrated assignment when estimating interviewer effects. This approach, which we refer to as the "anchoring" method, leverages correlations between observed variables that are unlikely to be affected by interviewers ("anchors") and variables that may be prone to interviewer effects to remove components of within-interviewer correlations that lack of interpenetrated assignment may introduce. We consider both frequentist and Bayesian approaches, where the latter can make use of information about interviewer effect variances in previous waves of a study, if available. We evaluate this new methodology empirically using a simulation study, and then illustrate its application using real survey data from the Behavioral Risk Factor Surveillance System (BRFSS), where interviewer IDs are provided on public-use data files. While our proposed method shares some of the limitations of the traditional approach - namely the need for variables associated with the outcome of interest that are also free of measurement error - it avoids the need for conditional inference and thus has improved inferential qualities when the focus is on marginal estimates, and it shows evidence of further reducing overestimation of larger interviewer effects relative to the traditional approach.
A note on multiply robust predictive mean matching imputation with complex survey data
Predictive mean matching is a commonly used imputation procedure for addressing the problem of item nonrespone in surveys. The customary approach relies upon the specification of a single outcome regression model. In this note, we propose a novel predictive mean matching procedure that allows the user to specify multiple outcome regression models. The resulting estimator is multiply robust in the sense that it remains consistent if one of the specified outcome regression models is correctly specified. The results from a simulation study suggest that the proposed method performs well in terms of bias and efficiency.
Optimum allocation for a dual-frame telephone survey
Careful design of a dual-frame random digit dial (RDD) telephone survey requires selecting from among many options that have varying impacts on cost, precision, and coverage in order to obtain the best possible implementation of the study goals. One such consideration is whether to screen cell-phone households in order to interview cell-phone only (CPO) households and exclude dual-user household, or to take all interviews obtained via the cell-phone sample. We present a framework in which to consider the tradeoffs between these two options and a method to select the optimal design. We derive and discuss the optimum allocation of sample size between the two sampling frames and explore the choice of optimum , the mixing parameter for the dual-user domain. We illustrate our methods using the , sponsored by the Centers for Disease Control and Prevention.
Combining information from multiple complex surveys
This manuscript describes the use of multiple imputation to combine information from multiple surveys of the same underlying population. We use a newly developed method to generate synthetic populations nonparametrically using a finite population Bayesian bootstrap that automatically accounting for complex sample designs. We then analyze each synthetic population with standard complete-data software for simple random samples and obtain valid inference by combining the point and variance estimates using extensions of existing combining rules for synthetic data. We illustrate the approach by combining data from the 2006 National Health Interview Survey (NHIS) and the 2006 Medical Expenditure Panel Survey (MEPS).
A nonparametric method to generate synthetic populations to adjust for complex sampling design features
Outside of the survey sampling literature, samples are often assumed to be generated by a simple random sampling process that produces independent and identically distributed (IID) samples. Many statistical methods are developed largely in this IID world. Application of these methods to data from complex sample surveys without making allowance for the survey design features can lead to erroneous inferences. Hence, much time and effort have been devoted to develop the statistical methods to analyze complex survey data and account for the sample design. This issue is particularly important when generating synthetic populations using finite population Bayesian inference, as is often done in missing data or disclosure risk settings, or when combining data from multiple surveys. By extending previous work in finite population Bayesian bootstrap literature, we propose a method to generate synthetic populations from a posterior predictive distribution in a fashion inverts the complex sampling design features and generates simple random samples from a superpopulation point of view, making adjustment on the complex data so that they can be analyzed as simple random samples. We consider a simulation study with a stratified, clustered unequal-probability of selection sample design, and use the proposed nonparametric method to generate synthetic populations for the 2006 National Health Interview Survey (NHIS), and the Medical Expenditure Panel Survey (MEPS), which are stratified, clustered unequal-probability of selection sample designs.
Bayesian inference for finite population quantiles from unequal probability samples
This paper develops two Bayesian methods for inference about finite population quantiles of continuous survey variables from unequal probability sampling. The first method estimates cumulative distribution functions of the continuous survey variable by fitting a number of probit penalized spline regression models on the inclusion probabilities. The finite population quantiles are then obtained by inverting the estimated distribution function. This method is quite computationally demanding. The second method predicts non-sampled values by assuming a smoothly-varying relationship between the continuous survey variable and the probability of inclusion, by modeling both the mean function and the variance function using splines. The two Bayesian spline-model-based estimators yield a desirable balance between robustness and efficiency. Simulation studies show that both methods yield smaller root mean squared errors than the sample-weighted estimator and the ratio and difference estimators described by Rao, Kovar, and Mantel (RKM 1990), and are more robust to model misspecification than the regression through the origin model-based estimator described in Chambers and Dunstan (1986). When the sample size is small, the 95% credible intervals of the two new methods have closer to nominal confidence coverage than the sample-weighted estimator.
Bayesian penalized spline model-based inference for finite population proportion in unequal probability sampling
We propose a Bayesian Penalized Spline Predictive (BPSP) estimator for a finite population proportion in an unequal probability sampling setting. This new method allows the probabilities of inclusion to be directly incorporated into the estimation of a population proportion, using a probit regression of the binary outcome on the penalized spline of the inclusion probabilities. The posterior predictive distribution of the population proportion is obtained using Gibbs sampling. The advantages of the BPSP estimator over the Hájek (HK), Generalized Regression (GR), and parametric model-based prediction estimators are demonstrated by simulation studies and a real example in tax auditing. Simulation studies show that the BPSP estimator is more efficient, and its 95% credible interval provides better confidence coverage with shorter average width than the HK and GR estimators, especially when the population proportion is close to zero or one or when the sample is small. Compared to linear model-based predictive estimators, the BPSP estimators are robust to model misspecification and influential observations in the sample.
Optimal sample allocation for design-consistent regression in a cancer services survey when design variables are known for aggregates
We consider optimal sampling rates in element-sampling designs when the anticipated analysis is survey-weighted linear regression and the estimands of interest are linear combinations of regression coefficients from one or more models. Methods are first developed assuming that exact design information is available in the sampling frame and then generalized to situations in which some design variables are available only as aggregates for groups of potential subjects, or from inaccurate or old data. We also consider design for estimation of combinations of coefficients from more than one model. A further generalization allows for flexible combinations of coefficients chosen to improve estimation of one effect while controlling for another. Potential applications include estimation of means for several sets of overlapping domains, or improving estimates for subpopulations such as minority races by disproportionate sampling of geographic areas. In the motivating problem of designing a survey on care received by cancer patients (the CanCORS study), potential design information included block-level census data on race/ethnicity and poverty as well as individual-level data. In one study site, an unequal-probability sampling design using the subjectss residential addresses and census data would have reduced the variance of the estimator of an income effect by 25%, or by 38% if the subjects' races were also known. With flexible weighting of the income contrasts by race, the variance of the estimator would be reduced by 26% using residential addresses alone and by 52% using addresses and races. Our methods would be useful in studies in which geographic oversampling by race-ethnicity or socioeconomic characteristics is considered, or in any study in which characteristics available in sampling frames are measured with error.
A variation of the housing unit method for estimating the population of small, rural areas: a case study of the local expert procedure
A demographic approach to the evaluation of the 1986 census and the estimates of Canada's population
"A significant increase in coverage error in the 1986 [Canadian] Census is revealed by both the Reverse Record Check and the demographic method presented in this paper. Considerable attention is paid to an evaluation of the various components of population growth, especially interprovincial migration. The paper concludes with an overview of two alternative methods for generating postcensal estimates: the currently-in-use, census-based model, and a flexible model using all relevant data in combination with the census."
Modeling matching error and its effect on estimates of census coverage error
"In this paper, we propose a model for investigating the effect of matching error on the estimators of census undercount and illustrate its use for the 1990 [U.S.] census undercount evaluation program. The mean square error [MSE] of the dual system estimator is derived under the proposed model and the components of MSE arising from matching error are defined and explained. Under the assumed model, the effect of matching error on the MSE of the estimator of census undercount is investigated. Finally, a methodology for employing the model for the optimal design of matching error evaluation studies will be illustrated and the form of the estimators will be given."
The 1986 Test of Adjustment Related Operations in Central Los Angeles County
The author presents the methodology and results of a 1986 test census conducted in Central Los Angeles County, California, to examine the feasibility of adjusting the census for the estimated undercount using a post-enumeration survey. "The results of the dual-system estimates are presented for the test site by the three major race/ethnic groups (Hispanic, Asian, Other) by tenure, by age and by sex. Summaries of the small area adjustments of the census enumeration, by block, are presented and discussed."
Handling missing data in coverage estimation, with application to the 1986 Test of Adjustment Related Operations
"This paper discusses methods used to handle missing data in post-enumeration surveys for estimating census coverage error, as illustrated for the 1986 Test of Adjustment Related Operations (Diffendal 1988). The methods include imputation schemes based on hot-deck and logistic regression models as well as weighting adjustments. The sensivity of undercount estimates from the 1986 test to variations in the imputation models is also explored." The test was carried out in Central Los Angeles County, California.
Measuring accuracy in a post-enumeration survey
"The U.S. Bureau of the Census will use a post-enumeration survey to measure the coverage of the 1990 Decennial Census. The Census Bureau has developed and tested new procedures aimed at increasing the accuracy of the survey. This paper describes the new methods. It discusses the categories of error that occur in a post-enumeration survey and means of evaluation to determine that the results are accurate. The new methods and the evaluation of the methods are discussed in the context of a recent test post-enumeration survey."
Fully Synthetic Data for Complex Surveys
When seeking to release public use files for confidential data, statistical agencies can generate fully synthetic data. We propose an approach for making fully synthetic data from surveys collected with complex sampling designs. Our approach adheres to the general strategy proposed by Rubin (1993). Specifically, we generate pseudo-populations by applying the weighted finite population Bayesian bootstrap to account for survey weights, take simple random samples from those pseudo-populations, estimate synthesis models using these simple random samples, and release simulated data drawn from the models as public use files. To facilitate variance estimation, we use the framework of multiple imputation with two data generation strategies. In the first, we generate multiple data sets from each simple random sample. In the second, we generate a single synthetic data set from each simple random sample. We present multiple imputation combining rules for each setting. We illustrate the repeated sampling properties of the combining rules via simulation studies, including comparisons with synthetic data generation based on pseudo-likelihood methods. We apply the proposed methods to a subset of data from the American Community Survey.
