INTERNATIONAL STATISTICAL REVIEW

A Spatial Variance-Smoothing Area Level Model for Small Area Estimation of Demographic Rates
Gao PA and Wakefield J
Accurate estimates of subnational health and demographic indicators are critical for informing policy. Many countries collect relevant data using complex household surveys, but when data are limited, direct weighted estimates of small area proportions may be unreliable. Area level models treating these direct estimates as response data can improve precision but often require known sampling variances of the direct estimators for all areas. In practice, the sampling variances are estimated, so standard approaches do not account for a key source of uncertainty. To account for variability in the estimated sampling variances, we propose a hierarchical Bayesian spatial area level model for small area proportions that smooths both the estimated proportions and sampling variances to produce point and interval estimates of rates of interest. We demonstrate the performance of our approach via simulation and application to vaccination coverage and HIV prevalence data from the Demographic and Health Surveys.
Survival Modelling For Data From Combined Cohorts: Opening the Door to Meta Survival Analyses and Survival Analysis using Electronic Health Records
McVittie JH, Best AF, Wolfson DB, Stephens DA, Wolfson J, Buckeridge DL and Gadalla SM
Non-parametric estimation of the survival function using observed failure time data depends on the underlying data generating mechanism, including the ways in which the data may be censored and/or truncated. For data arising from a single source or collected from a single cohort, a wide range of estimators have been proposed and compared in the literature. Often, however, it may be possible, and indeed advantageous, to combine and then analyze survival data that have been collected under different study designs. We review non-parametric survival analysis for data obtained by combining the most common types of cohort. We have two main goals: (i) To clarify the differences in the model assumptions, and (ii) to provide a single lens through which some of the proposed estimators may be viewed. Our discussion is relevant to the meta analysis of survival data obtained from different types of study, and to the modern era of electronic health records.
Elaboration Models with Symmetric Information Divergence
Asadi M, Devarajan K, Ebrahimi N, Soofi ES and Spirko-Burns L
Various statistical methodologies embed a probability distribution in a more flexible family of distributions. The latter is called , which is constructed by choice or a formal procedure and evaluated by asymmetric measures such as the likelihood ratio and Kullback-Leibler information. The use of asymmetric measures can be problematic for this purpose. This paper introduces two formal procedures, referred to as link functions, that embed any baseline distribution with a continuous density on the real line into model elaborations. Conditions are given for the link functions to render symmetric Kullback-Leibler divergence, Rényi divergence, and phi-divergence family. The first link function elaborates quantiles of the baseline probability distribution. This approach produces continuous counterparts of the binary probability models. Examples include the Cauchy, probit, logit, Laplace, and Student- links. The second link function elaborates the baseline survival function. Examples include the proportional odds and change point links. The logistic distribution is characterized as the one that satisfies the conditions for both links. An application demonstrates advantages of symmetric divergence measures for assessing the efficacy of covariates.
Global seasonal and pandemic patterns in influenza: An application of longitudinal study designs
Naumova EN, Simpson RB, Zhou B and Hartwick MA
The confluence of growing analytic capacities and global surveillance systems for seasonal infections has created new opportunities to further develop statistical methodology and advance the understanding of the global disease dynamics. We developed a framework to characterise the seasonality of infectious diseases for publicly available global health surveillance data. Specifically, we aimed to estimate the seasonal characteristics and their uncertainty using mixed effects models with harmonic components and the δ-method and develop multi-panel visualisations to present complex interplay of seasonal peaks across geographic locations. We compiled a set of 2 422 weekly time series of 14 reported outcomes for 173 Member States from the World Health Organization's (WHO) international influenza virological surveillance system, FluNet, from 02 January 1995 through 20 June 2021. We produced an analecta of data visualisations to describe global travelling waves of influenza while addressing issues of data completeness and credibility. Our results offer directions for further improvements in data collection, reporting, analysis and development of statistical methodology and predictive approaches.
Temporal Models for Demographic and Global Health Outcomes in Multiple Populations: Introducing a New Framework to Review and Standardise Documentation of Model Assumptions and Facilitate Model Comparison
Susmann H, Alexander M and Alkema L
There is growing interest in producing estimates of demographic and global health indicators in populations with limited data. Statistical models are needed to combine data from multiple data sources into estimates and projections with uncertainty. Diverse modelling approaches have been applied to this problem, making comparisons between models difficult. We propose a model class, Temporal Models for Multiple Populations (TMMPs), to facilitate both documentation of model assumptions in a standardised way and comparison across models. The class makes a distinction between the process model, which describes latent trends in the indicator interest, and the data model, which describes the data generating process of the observed data. We provide a general notation for the process model that encompasses many popular temporal modelling techniques, and we show how existing models for a variety of indicators can be written using this notation. We end with a discussion of outstanding questions and future directions.
Synergy of Biostatistics and Epidemiology in Air Pollution Health Effects Studies
Dockery DW
The extraordinary advances in quantifying the health effects of ambient air pollution over the last five decades have led to dramatic improvement in air quality in the United States. This work has been possible through innovative epidemiologic study designs coupled with advanced statistical analytic methods. This paper presents a historical perspective on the coordinated developments of epidemiologic designs and statistical methods for air pollution health effects studies at the Harvard School of Public Health.
A Legacy of EM Algorithms
Lange K and Zhou H
Nan Laird has an enormous and growing impact on computational statistics. Her paper with Dempster and Rubin on the expectation-maximisation (EM) algorithm is the second most cited paper in statistics. Her papers and book on longitudinal modelling are nearly as impressive. In this brief survey, we revisit the derivation of some of her most useful algorithms from the perspective of the minorisation-maximisation (MM) principle. The MM principle generalises the EM principle and frees it from the shackles of missing data and conditional expectations. Instead, the focus shifts to the construction of surrogate functions via standard mathematical inequalities. The MM principle can deliver a classical EM algorithm with less fuss or an entirely new algorithm with a faster rate of convergence. In any case, the MM principle enriches our understanding of the EM principle and suggests new algorithms of considerable potential in high-dimensional settings where standard algorithms such as Newton's method and Fisher scoring falter.
Path algorithms for fused lasso signal approximator with application to COVID-19 spread in Korea
Son W, Lim J and Yu D
The fused lasso signal approximator (FLSA) is a smoothing procedure for noisy observations that uses fused lasso penalty on unobserved mean levels to find sparse signal blocks. Several path algorithms have been developed to obtain the whole solution path of the FLSA. However, it is known that the FLSA has model selection inconsistency when the underlying signals have a stair-case block, where three consecutive signal blocks are either strictly increasing or decreasing. Modified path algorithms for the FLSA have been proposed to guarantee model selection consistency regardless of the stair-case block. In this paper, we provide a comprehensive review of the path algorithms for the FLSA and prove the properties of the recently modified path algorithms' hitting times. Specifically, we reinterpret the modified path algorithm as the path algorithm for local FLSA problems and reveal the condition that the hitting time for the fusion of the modified path algorithm is not monotone in a tuning parameter. To recover the monotonicity of the solution path, we propose a pathwise adaptive FLSA having monotonicity with similar performance as the modified solution path algorithm. Finally, we apply the proposed method to the number of daily-confirmed cases of COVID-19 in Korea to identify the change points of its spread.
On testing for homogeneity with zero-inflated models through the lens of model misspecification
Hsu WW, Mawella NR and Todem D
In many applications of two-component mixture models such as the popular zero-inflated model for discrete-valued data, it is customary for the data analyst to evaluate the inherent heterogeneity in view of observed data. To this end, the score test, acclaimed for its simplicity, is routinely performed. It has long been recognized that this test may behave erratically under model misspecification, but the implications of this behavior remain poorly understood for popular two-component mixture models. For the special case of zero-inflated count models, we use data simulations and theoretical arguments to evaluate this behavior and discuss its implications in settings where the working model is restrictive with regard to the true data generating mechanism. We enrich this discussion with an analysis of count data in HIV research, where a one-component model is shown to fit the data reasonably well despite apparent extra zeros. These results suggest that a rejection of homogeneity does not imply that the underlying mixture model is appropriate. Rather, such a rejection simply implies that the mixture model should be carefully interpreted in the light of potential model misspecifications, and further evaluated against other competing models.
A Review of Spatial Causal Inference Methods for Environmental and Epidemiological Applications
Reich BJ, Yang S, Guan Y, Giffin AB, Miller MJ and Rappold A
The scientific rigor and computational methods of causal inference have had great impacts on many disciplines but have only recently begun to take hold in spatial applications. Spatial causal inference poses analytic challenges due to complex correlation structures and interference between the treatment at one location and the outcomes at others. In this paper, we review the current literature on spatial causal inference and identify areas of future work. We first discuss methods that exploit spatial structure to account for unmeasured confounding variables. We then discuss causal analysis in the presence of spatial interference including several common assumptions used to reduce the complexity of the interference patterns under consideration. These methods are extended to the spatiotemporal case where we compare and contrast the potential outcomes framework with Granger causality and to geostatistical analyses involving spatial random fields of treatments and responses. The methods are introduced in the context of observational environmental and epidemiological studies and are compared using both a simulation study and analysis of the effect of ambient air pollution on COVID-19 mortality rate. Code to implement many of the methods using the popular Bayesian software OpenBUGS is provided.
Double Empirical Bayes Testing
Tansey W, Wang Y, Rabadan R and Blei DM
Analyzing data from large-scale, multi-experiment studies requires scientists to both analyze each experiment and to assess the results as a whole. In this article, we develop double empirical Bayes testing (DEBT), an empirical Bayes method for analyzing multi-experiment studies when many covariates are gathered per experiment. DEBT is a two-stage method: in the first stage, it reports which experiments yielded significant outcomes; in the second stage, it hypothesizes which covariates drive the experimental significance. In both of its stages, DEBT builds on Efron (2008), which lays out an elegant empirical Bayes approach to testing. DEBT enhances this framework by learning a series of black box predictive models to boost power and control the false discovery rate (FDR). In Stage 1, it uses a deep neural network prior to report which experiments yielded significant outcomes. In Stage 2, it uses an empirical Bayes version of the knockoff filter (Candes et al., 2018) to select covariates that have significant predictive power of Stage-1 significance. In both simulated and real data, DEBT increases the proportion of discovered significant outcomes and selects more features when signals are weak. In a real study of cancer cell lines, DEBT selects a robust set of biologically-plausible genomic drivers of drug sensitivity and resistance in cancer.
Reluctant Generalised Additive Modelling
Tay JK and Tibshirani R
Sparse generalised additive models (GAMs) are an extension of sparse generalised linear models that allow a model's prediction to vary non-linearly with an input variable. This enables the data analyst build more accurate models, especially when the linearity assumption is known to be a poor approximation of reality. Motivated by reluctant interaction modelling, we propose a multi-stage algorithm, called , that can fit sparse GAMs at scale. It is guided by the principle that, if all else is equal, one should prefer a linear feature over a non-linear feature. Unlike existing methods for sparse GAMs, RGAM can be extended easily to binary, count and survival data. We demonstrate the method's effectiveness on real and simulated examples.
Small Area Estimation for Disease Prevalence Mapping
Wakefield J, Okonek T and Pedersen J
Small area estimation (SAE) entails estimating characteristics of interest for domains, often geographical areas, in which there may be few or no samples available. SAE has a long history and a wide variety of methods have been suggested, from a bewildering range of philosophical standpoints. We describe design-based and model-based approaches and models that are specified at the area-level and at the unit-level, focusing on health applications and fully Bayesian spatial models. The use of auxiliary information is a key ingredient for successful inference when response data are sparse and we discuss a number of approaches that allow the inclusion of covariate data. SAE for HIV prevalence, using data collected from a Demographic Health Survey in Malawi in 2015-2016, is used to illustrate a number of techniques. The potential use of SAE techniques for outcomes related to COVID-19 is discussed.
Statistical Implementations of Agent-Based Demographic Models
Hooten M, Wikle C and Schwob M
A variety of demographic statistical models exist for studying population dynamics when individuals can be tracked over time. In cases where data are missing due to imperfect detection of individuals, the associated measurement error can be accommodated under certain study designs (e.g. those that involve multiple surveys or replication). However, the interaction of the measurement error and the underlying dynamic process can complicate the implementation of statistical agent-based models (ABMs) for population demography. In a Bayesian setting, traditional computational algorithms for fitting hierarchical demographic models can be prohibitively cumbersome to construct. Thus, we discuss a variety of approaches for fitting statistical ABMs to data and demonstrate how to use multi-stage recursive Bayesian computing and statistical emulators to fit models in such a way that alleviates the need to have analytical knowledge of the ABM likelihood. Using two examples, a demographic model for survival and a compartment model for COVID-19, we illustrate statistical procedures for implementing ABMs. The approaches we describe are intuitive and accessible for practitioners and can be parallelised easily for additional computational efficiency.
A Review of Multi-Compartment Infectious Disease Models
Tang L, Zhou Y, Wang L, Purkayastha S, Zhang L, He J, Wang F and Song PX
Multi-compartment models have been playing a central role in modelling infectious disease dynamics since the early 20th century. They are a class of mathematical models widely used for describing the mechanism of an evolving epidemic. Integrated with certain sampling schemes, such mechanistic models can be applied to analyse public health surveillance data, such as assessing the effectiveness of preventive measures (e.g. social distancing and quarantine) and forecasting disease spread patterns. This review begins with a nationwide macromechanistic model and related statistical analyses, including model specification, estimation, inference and prediction. Then, it presents a community-level micromodel that enables high-resolution analyses of regional surveillance data to provide current and future risk information useful for local government and residents to make decisions on reopenings of local business and personal travels. r software and scripts are provided whenever appropriate to illustrate the numerical detail of algorithms and calculations. The coronavirus disease 2019 pandemic surveillance data from the state of Michigan are used for the illustration throughout this paper.
Review and Comparison of Computational Approaches for Joint Longitudinal and Time-to-Event Models
Furgal AKC, Sen A and Taylor JMG
Joint models for longitudinal and time-to-event data are useful in situations where an association exists between a longitudinal marker and an event time. These models are typically complicated due to the presence of shared random effects and multiple submodels. As a consequence, software implementation is warranted that is not prohibitively time consuming. While methodological research in this area continues, several statistical software procedures exist to assist in the fitting of some joint models. We review the available implementation for frequentist and Bayesian models in the statistical programming languages R, SAS, and Stata. A description of each procedure is given including estimation techniques, input and data requirements, available options for customization, and some available extensions, such as competing risks models. The software implementations are compared and contrasted through extensive simulation, highlighting their strengths and weaknesses. Data from an ongoing trial on adrenal cancer patients is used to study different nuances of software fitting on a practical example.
Semiparametric Regression Analysis of Panel Count Data: A Practical Review
Chiou SH, Huang CY, Xu G and Yan J
Panel count data arise in many applications when the event history of a recurrent event process is only examined at a sequence of discrete time points. In spite of the recent methodological developments, the availability of their software implementations has been rather limited. Focusing on a practical setting where the effects of some time-independent covariates on the recurrent events are of primary interest, we review semiparametric regression modelling approaches for panel count data that have been implemented in R package spef. The methods are grouped into two categories depending on whether the examination times are associated with the recurrent event process after conditioning on covariates. The reviewed methods are illustrated with a subset of the data from a skin cancer clinical trial.
Confidence Intervals for the Area Under the Receiver Operating Characteristic Curve in the Presence of Ignorable Missing Data
Cho H, Matthews GJ and Harel O
Receiver operating characteristic curves are widely used as a measure of accuracy of diagnostic tests and can be summarised using the area under the receiver operating characteristic curve (AUC). Often, it is useful to construct a confidence interval for the AUC; however, because there are a number of different proposed methods to measure variance of the AUC, there are thus many different resulting methods for constructing these intervals. In this article, we compare different methods of constructing Wald-type confidence interval in the presence of missing data where the missingness mechanism is ignorable. We find that constructing confidence intervals using multiple imputation based on logistic regression gives the most robust coverage probability and the choice of confidence interval method is less important. However, when missingness rate is less severe (e.g. less than 70%), we recommend using Newcombe's Wald method for constructing confidence intervals along with multiple imputation using predictive mean matching.
Geostatistical Methods for Disease Mapping and Visualisation Using Data from Spatio-temporally Referenced Prevalence Surveys
Giorgi E, Diggle PJ, Snow RW and Noor AM
In this paper, we set out general principles and develop geostatistical methods for the analysis of data from spatio-temporally referenced prevalence surveys. Our objective is to provide a tutorial guide that can be used in order to identify parsimonious geostatistical models for prevalence mapping. A general variogram-based Monte Carlo procedure is proposed to check the validity of the modelling assumptions. We describe and contrast likelihood-based and Bayesian methods of inference, showing how to account for parameter uncertainty under each of the two paradigms. We also describe extensions of the standard model for disease prevalence that can be used when stationarity of the spatio-temporal covariance function is not supported by the data. We discuss how to define predictive targets and argue that exceedance probabilities provide one of the most effective ways to convey uncertainty in prevalence estimates. We describe statistical software for the visualisation of spatio-temporal predictive summaries of prevalence through interactive animations. Finally, we illustrate an application to historical malaria prevalence data from 1 334 surveys conducted in Senegal between 1905 and 2014.
Towards a Routine External Evaluation Protocol for Small Area Estimation
Dorfman AH
Statistical criteria are needed by which to evaluate the potential success or failure of applications of small area estimation. A necessary step to achieve this is a protocol-a series of steps-by which to assess whether an instance of small area estimation has given satisfactory results or not. Most customary attempts at evaluation of small area techniques have deficiencies. Often, evaluation is not attempted. Every small area study requires an . With proper planning, this can be routinely achieved, although at some cost, amounting to some sacrifice of efficiency of global estimates. We propose a Routine External Evaluation Protocol to allow us to judge whether, in a given survey, small area estimation has led to accurate results and sound inference.
Optimal Adaptive Designs with Inverse Ordinary Differential Equations
Demidenko E
Many industrial and engineering applications are built on the basis of differential equations. In some cases, parameters of these equations are not known and are estimated from measurements leading to an inverse problem. Unlike many other papers, we suggest to construct new designs in the adaptive fashion 'on the go' using the A-optimality criterion. This approach is demonstrated on determination of optimal locations of measurements and temperature sensors in several engineering applications: (1) determination of the optimal location to measure the height of a hanging wire in order to estimate the sagging parameter with minimum variance (toy example), (2) adaptive determination of optimal locations of temperature sensors in a one-dimensional inverse heat transfer problem and (3) adaptive design in the framework of a one-dimensional diffusion problem when the solution is found numerically using the finite difference approach. In all these problems, statistical criteria for parameter identification and optimal design of experiments are applied. Statistical simulations confirm that estimates derived from the adaptive optimal design converge to the true parameter values with minimum sum of variances when the number of measurements increases. We deliberately chose technically uncomplicated industrial problems to transparently introduce principal ideas of statistical adaptive design.