Robust causal inference for point exposures with missing confounders
Large observational databases are often subject to missing data. As such, methods for causal inference must simultaneously handle confounding and missingness; surprisingly little work has been done at this intersection. Motivated by this, we propose an efficient and robust estimator of the causal average treatment effect from cohort studies when confounders are missing at random. The approach is based on a novel factorization of the likelihood that, unlike alternative methods, facilitates flexible modelling of nuisance functions (e.g., with stateof-the-art machine learning methods) while maintaining nominal convergence rates of the final estimators. Simulated data, derived from an electronic health record-based study of the long-term effects of bariatric surgery on weight outcomes, verify the robustness properties of the proposed estimators in finite samples. Our approach may serve as a theoretical benchmark against which ad-hoc methods may be assessed.
Variable selection in modelling clustered data via within-cluster resampling
In many biomedical applications, there is a need to build risk-adjustment models based on clustered data. However, methods for variable selection that are applicable to clustered discrete data settings with a large number of candidate variables and potentially large cluster sizes are lacking. We develop a new variable selection approach that combines within-cluster resampling techniques with penalized likelihood methods to select variables for high-dimensional clustered data. We derive an upper bound on the expected number of falsely selected variables, demonstrate the oracle properties of the proposed method, and evaluate the finite sample performance of the method through extensive simulations. We illustrate the proposed approach using a colon surgical site infection data set consisting of 39,468 individuals from 149 hospitals to build risk-adjustment models that account for both the main effects of various risk factors and their two-way interactions.
Debiased lasso after sample splitting for estimation and inference in high-dimensional generalized linear models
We consider random sample splitting for estimation and inference in high dimensional generalized linear models, where we first apply the lasso to select a submodel using one subsample and then apply the debiased lasso to fit the selected model using the remaining subsample. We show that a sample splitting procedure based on the debiased lasso yields asymptotically normal estimates under mild conditions and that multiple splitting can address the loss of efficiency. Our simulation results indicate that using the debiased lasso instead of the standard maximum likelihood method in the estimation stage can vastly reduce the bias and variance of the resulting estimates. Furthermore, our multiple splitting debiased lasso method has better numerical performance than some existing methods for high dimensional generalized linear models proposed in the recent literature. We illustrate the proposed multiple splitting method with an analysis of the smoking data of the Mid-South Tobacco Case-Control Study.
Robust Estimation of Loss-Based Measures of Model Performance under Covariate Shift
We present methods for estimating loss-based measures of the performance of a prediction model in a target population that differs from the source population in which the model was developed, in settings where outcome and covariate data are available from the source population but only covariate data are available on a simple random sample from the target population. Prior work adjusting for differences between the two populations has used various weighting estimators with inverse odds or density ratio weights. Here, we develop more robust estimators for the target population risk (expected loss) that can be used with data-adaptive (e.g., machine learning-based) estimation of nuisance parameters. We examine the large-sample properties of the estimators and evaluate finite sample performance in simulations. Last, we apply the methods to data from lung cancer screening using nationally representative data from the National Health and Nutrition Examination Survey (NHANES) and extend our methods to account for the complex survey design of the NHANES.
High-dimensional variable selection accounting for heterogeneity in regression coefficients across multiple data sources
When analyzing data combined from multiple sources (e.g., hospitals, studies), the heterogeneity across different sources must be accounted for. In this paper, we consider high-dimensional linear regression models for integrative data analysis. We propose a new adaptive clustering penalty (ACP) method to simultaneously select variables and cluster source-specific regression coefficients with sub-homogeneity. We show that the estimator based on the ACP method enjoys a strong oracle property under certain regularity conditions. We also develop an efficient algorithm based on the alternating direction method of multipliers (ADMM) for parameter estimation. We conduct simulation studies to compare the performance of the proposed method to three existing methods (a fused LASSO with adjacent fusion, a pairwise fused LASSO, and a multi-directional shrinkage penalty method). Finally, we apply the proposed method to the multi-center Childhood Adenotonsillectomy Trial to identify sub-homogeneity in the treatment effects across different study sites.
Smoothed model-assisted small area estimation of proportions
In countries where population census data are limited, generating accurate subnational estimates of health and demographic indicators is challenging. Existing model-based geostatistical methods leverage covariate information and spatial smoothing to reduce the variability of estimates but often ignore the survey design, while traditional small area estimation approaches may not incorporate both unit-level covariate information and spatial smoothing in a design consistent way. We propose a smoothed model-assisted estimator that accounts for survey design and leverages both unit-level covariates and spatial smoothing. Under certain regularity assumptions, this estimator is both design consistent and model consistent. We compare it with existing design-based and model-based estimators using real and simulated data.
Optimal multiwave validation of secondary use data with outcome and exposure misclassification
Observational databases provide unprecedented opportunities for secondary use in biomedical research. However, these data can be error-prone and must be validated before use. It is usually unrealistic to validate the whole database because of resource constraints. A cost-effective alternative is a two-phase design that validates a subset of records enriched for information about a particular research question. We consider odds ratio estimation under differential outcome and exposure misclassification and propose optimal designs that minimize the variance of the maximum likelihood estimator. Our adaptive grid search algorithm can locate the optimal design in a computationally feasible manner. Because the optimal design relies on unknown parameters, we introduce a multiwave strategy to approximate the optimal design. We demonstrate the proposed design's efficiency gains through simulations and two large observational studies.
Oscillating neural circuits: Phase, amplitude, and the complex normal distribution
Multiple oscillating time series are typically analyzed in the frequency domain, where coherence is usually said to represent the magnitude of the correlation between two signals at a particular frequency. The correlation being referenced is complex-valued and is similar to the real-valued Pearson correlation in some ways but not others. We discuss the dependence among oscillating series in the context of the multivariate complex normal distribution, which plays a role for vectors of complex random variables analogous to the usual multivariate normal distribution for vectors of real-valued random variables. We emphasize special cases that are valuable for the neural data we are interested in and provide new variations on existing results. We then introduce a complex latent variable model for narrowly band-pass-filtered signals at some frequency, and show that the resulting maximum likelihood estimate produces a latent coherence that is equivalent to the magnitude of the complex canonical correlation at the given frequency. We also derive an equivalence between partial coherence and the magnitude of complex partial correlation, at a given frequency. Our theoretical framework leads to interpretable results for an interesting multivariate dataset from the Allen Institute for Brain Science.
Efficient multiple change point detection for high-dimensional generalized linear models
Change point detection for high-dimensional data is an important yet challenging problem for many applications. In this paper, we consider multiple change point detection in the context of high-dimensional generalized linear models, allowing the covariate dimension to grow exponentially with the sample size . The model considered is general and flexible in the sense that it covers various specific models as special cases. It can automatically account for the underlying data generation mechanism without specifying any prior knowledge about the number of change points. Based on dynamic programming and binary segmentation techniques, two algorithms are proposed to detect multiple change points, allowing the number of change points to grow with . To further improve the computational efficiency, a more efficient algorithm designed for the case of a single change point is proposed. We present theoretical properties of our proposed algorithms, including estimation consistency for the number and locations of change points as well as consistency and asymptotic distributions for the underlying regression coefficients. Finally, extensive simulation studies and application to the Alzheimer's Disease Neuroimaging Initiative data further demonstrate the competitive performance of our proposed methods.
Integrating Information from Existing Risk Prediction Models with No Model Details
Consider the setting where (i) individual-level data are collected to build a regression model for the association between an event of interest and certain covariates, and (ii) some risk calculators predicting the risk of the event using less detailed covariates are available, possibly as algorithmic black boxes with little information available about how they were built. We propose a general empirical-likelihood-based framework to integrate the rich auxiliary information contained in the calculators into fitting the regression model, to make the estimation of regression parameters more efficient. Two methods are developed, one using working models to extract the calculator information and one making a direct use of calculator predictions without working models. Theoretical and numerical investigations show that the calculator information can substantially reduce the variance of regression parameter estimation. As an application, we study the dependence of the risk of high grade prostate cancer on both conventional risk factors and newly identified molecular biomarkers by integrating information from the Prostate Biopsy Collaborative Group (PBCG) risk calculator, which was built based on conventional risk factors alone.
Estimation of conditional cumulative incidence functions under generalized semiparametric regression models with missing covariates, with application to analysis of biomarker correlates in vaccine trials
This article studies generalized semiparametric regression models for conditional cumulative incidence functions with competing risks data when covariates are missing by sampling design or happenstance. A doubly-robust augmented inverse probability weighted complete-case (AIPW) approach to estimation and inference is investigated. This approach modifies IPW complete-case estimating equations by exploiting the key features in the relationship between the missing covariates and the phase-one data to improve efficiency. An iterative numerical procedure is derived to solve the nonlinear estimating equations. The asymptotic properties of the proposed estimators are established. A simulation study examining the finite-sample performances of the proposed estimators shows that the AIPW estimators are more efficient than the IPW estimators. The developed method is applied to the RV144 HIV-1 vaccine efficacy trial to investigate vaccine-induced IgG binding antibodies to HIV-1 as correlates of acquisition of HIV-1 infection while taking account of whether the HIV-1 sequences are near or far from the HIV-1 sequences represented in the vaccine construct.
Extended Bayesian endemic-epidemic models to incorporate mobility data into COVID-19 forecasting
Forecasting the number of daily COVID-19 cases is critical in the short-term planning of hospital and other public resources. One potentially important piece of information for forecasting COVID-19 cases is mobile device location data that measure the amount of time an individual spends at home. Endemic-epidemic (EE) time series models are recently proposed autoregressive models where the current mean case count is modelled as a weighted average of past case counts multiplied by an autoregressive rate, plus an endemic component. We extend EE models to include a distributed-lag model in order to investigate the association between mobility and the number of reported COVID-19 cases; we additionally include a weekly first-order random walk to capture additional temporal variation. Further, we introduce a shifted negative binomial weighting scheme for the past counts that is more flexible than previously proposed weighting schemes. We perform inference under a Bayesian framework to incorporate parameter uncertainty into model forecasts. We illustrate our methods using data from four US counties.
Estimation of SARS-CoV-2 antibody prevalence through serological uncertainty and daily incidence
Serology tests for SARS-CoV-2 provide a paradigm for estimating the number of individuals who have had an infection in the past (including cases that are not detected by routine testing, which has varied over the course of the pandemic and between jurisdictions). Such estimation is challenging in cases for which we only have limited serological data and do not take into account the uncertainty of the serology test. In this work, we provide a joint Bayesian model to improve the estimation of the sero-prevalence (the proportion of the population with SARS-CoV-2 antibodies) through integrating multiple sources of data, priors on the sensitivity and specificity of the serological test, and an effective epidemiological dynamics model. We apply our model to the Greater Vancouver area, British Columbia, Canada, with data acquired during the pandemic from the end of January to May 2020. Our estimated sero-prevalence is consistent with previous literature but with a tighter credible interval.
Estimating design operating characteristics in Bayesian adaptive clinical trials
Bayesian adaptive designs have gained popularity in all phases of clinical trials with numerous new developments in the past few decades. During the COVID-19 pandemic, the need to establish evidence for the effectiveness of vaccines, therapeutic treatments, and policies that could resolve or control the crisis emphasized the advantages offered by efficient and flexible clinical trial designs. In many COVID-19 clinical trials, because of the high level of uncertainty, Bayesian adaptive designs were considered advantageous. Designing Bayesian adaptive trials, however, requires extensive simulation studies that are generally considered challenging, particularly in time-sensitive settings such as a pandemic. In this article, we propose a set of methods for efficient estimation and uncertainty quantification for design operating characteristics of Bayesian adaptive trials. Specifically, we model the sampling distribution of Bayesian probability statements that are commonly used as the basis of decision making. To showcase the implementation and performance of the proposed approach, we use a clinical trial design with an ordinal disease-progression scale endpoint that was popular among COVID-19 trials. However, the proposed methodology may be applied generally in the clinical trial context where design operating characteristics cannot be obtained analytically.
Characterizing the COVID-19 dynamics with a new epidemic model: Susceptible-exposed-asymptomatic-symptomatic-active-removed
The coronavirus disease 2019 (COVID-19), caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has spread stealthily and presented a tremendous threat to the public. It is important to investigate the transmission dynamics of COVID-19 to help understand the impact of the disease on public health and the economy. In this article, we develop a new epidemic model that utilizes a set of ordinary differential equations with unknown parameters to delineate the transmission process of COVID-19. The model accounts for asymptomatic infections as well as the lag between symptom onset and the confirmation date of infection. To reflect the transmission potential of an infected case, we derive the from the proposed model. Using the daily reported number of confirmed cases, we describe an estimation procedure for the model parameters, which involves adapting the (IF-EAKF) algorithm. To illustrate the use of the proposed model, we examine the COVID-19 data from Quebec for the period from 2 April 2020 to 10 May 2020 and carry out sensitivity studies under a variety of assumptions. Simulation studies are used to evaluate the performance of the proposed model under a variety of settings.
Extreme quantile estimation for partial functional linear regression models with heavy-tailed distributions
In this article, we propose a novel estimator of extreme conditional quantiles in partial functional linear regression models with heavy-tailed distributions. The conventional quantile regression estimators are often unstable at the extreme tails due to data sparsity, especially for heavy-tailed distributions. We first estimate the slope function and the partially linear coefficient using a functional quantile regression based on functional principal component analysis, which is a robust alternative to the ordinary least squares regression. The extreme conditional quantiles are then estimated by using a new extrapolation technique from extreme value theory. We establish the asymptotic normality of the proposed estimator and illustrate its finite sample performance by simulation studies and an empirical analysis of diffusion tensor imaging data from a cognitive disorder study.
A nonlinear sparse neural ordinary differential equation model for multiple functional processes
In this article, we propose a new sparse neural ordinary differential equation (ODE) model to characterize flexible relations among multiple functional processes. We characterize the latent states of the functions via a set of ordinary differential equations. We then model the dynamic changes of the latent states using a deep neural network (DNN) with a specially designed architecture and a sparsity-inducing regularization. The new model is able to capture both nonlinear and sparse dependent relations among multivariate functions. We develop an efficient optimization algorithm to estimate the unknown weights for the DNN under the sparsity constraint. We establish both the algorithmic convergence and selection consistency, which constitute the theoretical guarantees of the proposed method. We illustrate the efficacy of the method through simulations and a gene regulatory network example.
Estimated reproduction ratios in the SIR model
The aim of this article is to understand the extreme variability in estimates of the reproduction ratio observed in practice. For expository purposes, we consider a discrete-time, stochastic version of the susceptible-infected-recovered model and introduce different approximate maximum likelihood estimators of . We carefully discuss the properties of these estimators and illustrate, by a Monte Carlo study, the widths of confidence intervals for .
Under-reporting of COVID-19 in the Northern Health Authority region of British Columbia
Asymptomatic and pauci-symptomatic presentations of COVID-19 along with restrictive testing protocols result in undetected COVID-19 cases. Estimating undetected cases is crucial to understanding the true severity of the outbreak. We introduce a new hierarchical disease dynamics model based on the -mixtures hidden population framework. The new models make use of three sets of disease count data per region: reported cases, recoveries and deaths. Treating the first two as under-counted through binomial thinning, we model the true population state at each time point by partitioning the diseased population into the active, recovered and died categories. Both domestic spread and imported cases are considered. These models are applied to estimate the level of under-reporting of COVID-19 in the Northern Health Authority region of British Columbia, Canada, during 30 weeks of the provincial recovery plan. Parameter covariates are easily implemented and used to improve model estimates. We compare two distinct methods of model-fitting for this case study: (1) maximum likelihood estimation, and (2) Bayesian Markov chain Monte Carlo. The two methods agreed exactly in their estimates of under-reporting rate. When accounting for changes in weekly testing volumes, we found under-reporting rates varying from 60.2% to 84.2%.
Connectivity-informed adaptive regularization for generalized outcomes
One of the challenging problems in neuroimaging is the principled incorporation of information from different imaging modalities. Data from each modality are frequently analyzed separately using, for instance, dimensionality reduction techniques, which result in a loss of mutual information. We propose a novel regularization method, generalized ridgified Partially Empirical Eigenvectors for Regression (griPEER), to estimate associations between the brain structure features and a scalar outcome within the generalized linear regression framework. griPEER improves the regression coefficient estimation by providing a principled approach to use external information from the structural brain connectivity. Specifically, we incorporate a penalty term, derived from the structural connectivity Laplacian matrix, in the penalized generalized linear regression. In this work, we address both theoretical and computational issues and demonstrate the robustness of our method despite incomplete information about the structural brain connectivity. In addition, we also provide a significance testing procedure for performing inference on the estimated coefficients. Finally, griPEER is evaluated both in extensive simulation studies and using clinical data to classify HIV+ and HIV- individuals.
