JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY

Covariate-assisted bounds on causal effects with instrumental variables
Levis AW, Bonvini M, Zeng Z, Keele L and Kennedy EH
When an exposure of interest is confounded by unmeasured factors, an instrumental variable (IV) can be used to identify and estimate certain causal contrasts. Identification of the marginal average treatment effect (ATE) from IVs relies on strong untestable structural assumptions. When one is unwilling to assert such structure, IVs can nonetheless be used to construct bounds on the ATE. Famously, Alexander Balke and Judea Pearl proved tight bounds on the ATE for a binary outcome, in a randomized trial with noncompliance and no covariate information. We demonstrate how these bounds remain useful in observational settings with baseline confounders of the IV, as well as randomized trials with measured baseline covariates. The resulting bounds on the ATE are nonsmooth functionals, and thus standard nonparametric efficiency theory is not immediately applicable. To remedy this, we propose (1) under a novel margin condition, influence function-based estimators of the bounds that can attain parametric convergence rates when the nuisance functions are modelled flexibly, and (2) estimators of smooth approximations of these bounds. We propose extensions to continuous outcomes, explore finite sample properties in simulations, and illustrate the proposed estimators in an observational study targeting the effect of higher education on wages.
Causal mediation analysis: selection with asymptotically valid inference
Jones J, Ertefaie A and Strawderman RL
Researchers are often interested in learning not only the effect of treatments on outcomes, but also the mechanisms that transmit these effects. A mediator is a variable that is affected by treatment and subsequently affects outcome. Existing methods for penalized mediation analyses may lead to ignoring important mediators and either assume that finite-dimensional linear models are sufficient to remove confounding bias, or perform no confounding control at all. In practice, these assumptions may not hold. We propose a method that considers the confounding functions as nuisance parameters to be estimated using data-adaptive methods. We then use a novel regularization method applied to this objective function to identify a set of important mediators. We consider natural direct and indirect effects as our target parameters. We then proceed to derive the asymptotic properties of our estimators and establish the oracle property under specific assumptions. Asymptotic results are also presented in a local setting, which contrast the proposal with the standard adaptive lasso. We also propose a perturbation bootstrap technique to provide asymptotically valid postselection inference for the mediated effects of interest. The performance of these methods will be discussed and demonstrated through simulation studies.
Robust evaluation of longitudinal surrogate markers with censored data
Agniel D and Parast L
The development of statistical methods to evaluate surrogate markers is an active area of research. In many clinical settings, the surrogate marker is not simply a single measurement but is instead a longitudinal trajectory of measurements over time, e.g. fasting plasma glucose measured every 6 months for 3 years. In general, available methods developed for the single-surrogate setting cannot accommodate a longitudinal surrogate marker. Furthermore, many of the methods have not been developed for use with primary outcomes that are time-to-event outcomes and/or subject to censoring. In this paper, we propose robust methods to evaluate a longitudinal surrogate marker in a censored time-to-event outcome setting. Specifically, we propose a method to define and estimate the proportion of the treatment effect on a censored primary outcome that is explained by the treatment effect on a longitudinal surrogate marker measured up to time . We accommodate both potential censoring of the primary outcome and of the surrogate marker. A simulation study demonstrates a good finite-sample performance of our proposed methods. We illustrate our procedures by examining repeated measures of fasting plasma glucose, a surrogate marker for diabetes diagnosis, using data from the diabetes prevention programme.
Robust angle-based transfer learning in high dimensions
Gu T, Han Y and Duan R
Transfer learning improves target model performance by leveraging data from related source populations, especially when target data are scarce. This study addresses the challenge of training high-dimensional regression models with limited target data in the presence of heterogeneous source populations. We focus on a practical setting where only parameter estimates of pretrained source models are available, rather than individual-level source data. For a single source model, we propose a novel angle-based transfer learning (angleTL) method that leverages concordance between source and target model parameters. AngleTL adapts to the signal strength of the target model, unifies several benchmark methods, and mitigates negative transfer when between-population heterogeneity is large. We extend angleTL to incorporate multiple source models, accounting for varying levels of relevance among them. Our high-dimensional asymptotic analysis provides insights into when a source model benefits the target model and demonstrates the superiority of angleTL over other methods. Extensive simulations validate these findings and highlight the feasibility of applying angleTL to transfer genetic risk prediction models across multiple biobanks.
A statistical view of column subset selection
Sood A and Hastie T
We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as column subset selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of principal variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum-likelihood estimation within a certain semi-parametric model. Within this model, we establish suitable conditions under which the CSS estimate is consistent in high dimensions, specifically in the proportional asymptotic regime where the number of variables over the sample size converges to a constant. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.
Probabilistic Richardson extrapolation
Oates CJ, Karvonen T, Teckentrup AL, Strocchi M and Niederer SA
For over a century, extrapolation methods have provided a powerful tool to improve the convergence order of a numerical method. However, these tools are not well-suited to modern computer codes, where multiple continua are discretized and convergence orders are not easily analysed. To address this challenge, we present a probabilistic perspective on Richardson extrapolation, a point of view that unifies classical extrapolation methods with modern multi-fidelity modelling, and handles uncertain convergence orders by allowing these to be statistically estimated. The approach is developed using Gaussian processes, leading to . Conditions are established under which extrapolation using the conditional mean achieves a polynomial (or even an exponential) speed-up compared to the original numerical method. Further, the probabilistic formulation unlocks the possibility of experimental design, casting the selection of fidelities as a continuous optimization problem, which can then be (approximately) solved. A case study involving a computational cardiac model demonstrates that practical gains in accuracy can be achieved using the GRE method.
Product Centred Dirichlet Processes for Bayesian Multiview Clustering
Dombowsky A and Dunson DB
While there is an immense literature on Bayesian methods for clustering, the multiview case has received little attention. This problem focuses on obtaining distinct but statistically dependent clusterings in a common set of entities for different data types. For example, clustering patients into subgroups with subgroup membership varying according to the domain of the patient variables. A challenge is how to model the across-view dependence between the partitions of patients into subgroups. The complexities of the partition space make standard methods to model dependence, such as correlation, infeasible. In this article, we propose CLustering with Independence Centring (CLIC), a clustering prior that uses a single parameter to explicitly model dependence between clusterings across views. CLIC is induced by the product centred Dirichlet process (PCDP), a novel hierarchical prior that bridges between independent and equivalent partitions. We show appealing theoretic properties, provide a finite approximation and prove its accuracy, present a marginal Gibbs sampler for posterior computation, and derive closed form expressions for the marginal and joint partition distributions for the CLIC model. On synthetic data and in an application to epidemiology, CLIC accurately characterizes view-specific partitions while providing inference on the dependence level.
Nonparametric estimation via partial derivatives
Dai X
Traditional nonparametric estimation methods often lead to a slow convergence rate in large dimensions and require unrealistically large dataset sizes for reliable conclusions. We develop an approach based on partial derivatives, either observed or estimated, to effectively estimate the function at near-parametric convergence rates. This novel approach and computational algorithm could lead to methods useful to practitioners in many areas of science and engineering. Our theoretical results reveal behaviour universal to this class of nonparametric estimation problems. We explore a general setting involving tensor product spaces and build upon the smoothing spline analysis of variance framework. For -dimensional models under full interaction, the optimal rates with gradient information on covariates are identical to those for the -interaction models without gradients and, therefore, the models are immune to the . For additive models, the optimal rates using gradient information are , thus achieving the . We demonstrate aspects of the theoretical results through synthetic and real data applications.
A focusing framework for testing bi-directional causal effects in Mendelian randomization
Li S and Ye T
Mendelian randomization (MR) is a powerful method that uses genetic variants as instrumental variables to infer the causal effect of a modifiable exposure on an outcome. We study inference for bi-directional causal relationships and causal directions with possibly pleiotropic genetic variants. We show that assumptions for common MR methods are often impossible or too stringent given the potential bi-directional relationships. We propose a new focusing framework for testing bi-directional causal effects and it can be coupled with many state-of-the-art MR methods. We provide theoretical guarantees for our proposal and demonstrate its performance using several simulated and real datasets.
Extended fiducial inference: toward an automated process of statistical inference
Liang F, Kim S and Sun Y
While fiducial inference was widely considered a big blunder by R.A. Fisher, the goal he initially set-'inferring the uncertainty of model parameters on the basis of observations'-has been continually pursued by many statisticians. To this end, we develop a new statistical inference method called extended Fiducial inference (EFI). The new method achieves the goal of fiducial inference by leveraging advanced statistical computing techniques while remaining scalable for big data. Extended Fiducial inference involves jointly imputing random errors realized in observations using stochastic gradient Markov chain Monte Carlo and estimating the inverse function using a sparse deep neural network (DNN). The consistency of the sparse DNN estimator ensures that the uncertainty embedded in observations is properly propagated to model parameters through the estimated inverse function, thereby validating downstream statistical inference. Compared to frequentist and Bayesian methods, EFI offers significant advantages in parameter estimation and hypothesis testing. Specifically, EFI provides higher fidelity in parameter estimation, especially when outliers are present in the observations; and eliminates the need for theoretical reference distributions in hypothesis testing, thereby automating the statistical inference process. Extended Fiducial inference also provides an innovative framework for semisupervised learning.
Catch me if you can: signal localization with knockoff -values
Gablenz P and Sabatti C
We consider problems where many, somewhat redundant, hypotheses are tested and we are interested in reporting the most precise rejections, with false discovery rate (FDR) control. This is the case, for example, when researchers are interested both in individual hypotheses as well as group hypotheses corresponding to intersections of sets of the original hypotheses, at several resolution levels. A concrete application is in genome-wide association studies, where, depending on the signal strengths, it might be possible to resolve the influence of individual genetic variants on a phenotype with greater or lower precision. To adapt to the unknown signal strength, analyses are conducted at multiple resolutions and researchers are most interested in the more precise discoveries. Assuring FDR control on the reported findings with these adaptive searches is, however, often impossible. To design a multiple comparison procedure that allows for an adaptive choice of resolution with FDR control, we leverage -values and linear programming. We adapt this approach to problems where knockoffs and group knockoffs have been successfully applied to test conditional independence hypotheses. We demonstrate its efficacy by analysing data from the UK Biobank.
Mode-wise principal subspace pursuit and matrix spiked covariance model
Tang R, Yuan M and Zhang AR
This paper introduces a novel framework called Mode-wise Principal Subspace Pursuit (MOP-UP) to extract hidden variations in both the row and column dimensions for matrix data. To enhance the understanding of the framework, we introduce a class of matrix-variate spiked covariance models that serve as inspiration for the development of the MOP-UP algorithm. The MOP-UP algorithm consists of two steps: Average Subspace Capture (ASC) and Alternating Projection. These steps are specifically designed to capture the row-wise and column-wise dimension-reduced subspaces which contain the most informative features of the data. ASC utilizes a novel average projection operator as initialization and achieves exact recovery in the noiseless setting. We analyse the convergence and non-asymptotic error bounds of MOP-UP, introducing a blockwise matrix eigenvalue perturbation bound that proves the desired bound, where classic perturbation bounds fail. The effectiveness and practical merits of the proposed framework are demonstrated through experiments on both simulated and real datasets. Lastly, we discuss generalizations of our approach to higher-order data.
Two-phase rejective sampling and its asymptotic properties
Yang S and Ding P
Rejective sampling improves design and estimation efficiency of single-phase sampling when auxiliary information in a finite population is available. When such auxiliary information is unavailable, we propose to use two-phase rejective sampling (TPRS), which involves measuring auxiliary variables for the sample of units in the first phase, followed by the implementation of rejective sampling for the outcome in the second phase. We explore the asymptotic design properties of double expansion and regression estimators under TPRS. We show that TPRS enhances the efficiency of the double-expansion estimator, rendering it comparable to a regression estimator. We further refine the design to accommodate varying importance of covariates and extend it to multi-phase sampling. We start with the theory for the population mean and then extend the theory to parameters defined by general estimating equations. Our asymptotic results for TPRS immediately cover the existing single-phase rejective sampling, under which the asymptotic theory has not been fully established.
Model-assisted sensitivity analysis for treatment effects under unmeasured confounding via regularized calibrated estimation
Tan Z
Consider sensitivity analysis for estimating average treatment effects under unmeasured confounding, assumed to satisfy a marginal sensitivity model. At the population level, we provide new representations for the sharp population bounds and doubly robust estimating functions. We also derive new, relaxed population bounds, depending on weighted linear outcome quantile regression. At the sample level, we develop new methods and theory for obtaining not only doubly robust point estimators for the relaxed population bounds with respect to misspecification of a propensity score model or an outcome mean regression model, but also model-assisted confidence intervals which are valid if the propensity score model is correctly specified, but the outcome quantile and mean regression models may be misspecified. The relaxed population bounds reduce to the sharp bounds if outcome quantile regression is correctly specified. For a linear outcome mean regression model, the confidence intervals are also doubly robust. Our methods involve regularized calibrated estimation, with Lasso penalties but carefully chosen loss functions, for fitting propensity score and outcome mean and quantile regression models. We present a simulation study and an empirical application to an observational study on the effects of right-heart catheterization. The proposed method is implemented in the R package RCALsa.
GENIUS-MAWII: for robust Mendelian randomization with many weak invalid instruments
Ye T, Liu Z, Sun B and Tchetgen Tchetgen E
Mendelian randomization (MR) addresses causal questions using genetic variants as instrumental variables. We propose a new MR method, G-Estimation under No Interaction with Unmeasured Selection (GENIUS)-MAny Weak Invalid IV, which simultaneously addresses the 2 salient challenges in MR: many weak instruments and widespread horizontal pleiotropy. Similar to MR-GENIUS, we use heteroscedasticity of the exposure to identify the treatment effect. We derive influence functions of the treatment effect, and then we construct a continuous updating estimator and establish its asymptotic properties under a many weak invalid instruments asymptotic regime by developing novel semiparametric theory. We also provide a measure of weak identification, an overidentification test, and a graphical diagnostic tool.
Testing high-dimensional multinomials with applications to text analysis
Cai TT, Ke ZT and Turner P
Motivated by applications in text mining and discrete distribution inference, we test for equality of probability mass functions of groups of high-dimensional multinomial distributions. Special cases of this problem include global testing for topic models, two-sample testing in authorship attribution, and closeness testing for discrete distributions. A test statistic, which is shown to have an asymptotic standard normal distribution under the null hypothesis, is proposed. This parameter-free limiting null distribution holds true without requiring identical multinomial parameters within each group or equal group sizes. The optimal detection boundary for this testing problem is established, and the proposed test is shown to achieve this optimal detection boundary across the entire parameter space of interest. The proposed method is demonstrated in simulation studies and applied to analyse two real-world datasets to examine, respectively, variation among customer reviews of Amazon movies and the diversity of statistical paper abstracts.
Doubly robust calibration of prediction sets under covariate shift
Yang Y, Kuchibhotla AK and Tchetgen Tchetgen E
Conformal prediction has received tremendous attention in recent years and has offered new solutions to problems in missing data and causal inference; yet these advances have not leveraged modern semi-parametric efficiency theory for more efficient uncertainty quantification. We consider the problem of obtaining well-calibrated prediction regions that can data adaptively account for a shift in the distribution of covariates between training and test data. Under a covariate shift assumption analogous to the standard missing at random assumption, we propose a general framework based on efficient influence functions to construct well-calibrated prediction regions for the unobserved outcome in the test sample without compromising coverage.
Interpretable discriminant analysis for functional data supported on random nonlinear domains with an application to Alzheimer's disease
Lila E, Zhang W, Rane Levendovszky S and
We introduce a novel framework for the classification of functional data supported on nonlinear, and possibly random, manifold domains. The motivating application is the identification of subjects with Alzheimer's disease from their cortical surface geometry and associated cortical thickness map. The proposed model is based upon a reformulation of the classification problem as a regularized multivariate functional linear regression model. This allows us to adopt a direct approach to the estimation of the most discriminant direction while controlling for its complexity with appropriate differential regularization. Our approach does not require prior estimation of the covariance structure of the functional predictors, which is computationally prohibitive in our application setting. We provide a theoretical analysis of the out-of-sample prediction error of the proposed model and explore the finite sample performance in a simulation setting. We apply the proposed method to a pooled dataset from Alzheimer's Disease Neuroimaging Initiative and Parkinson's Progression Markers Initiative. Through this application, we identify discriminant directions that capture both cortical geometric and thickness predictive features of Alzheimer's disease that are consistent with the existing neuroscience literature.
Gradient synchronization for multivariate functional data, with application to brain connectivity
Chen Y, Lin SC, Zhou Y, Carmichael O, Müller HG and Wang JL
Quantifying the association between components of multivariate random curves is of general interest and is a ubiquitous and basic problem that can be addressed with functional data analysis. An important application is the problem of assessing functional connectivity based on functional magnetic resonance imaging (fMRI), where one aims to determine the similarity of fMRI time courses that are recorded on anatomically separated brain regions. In the functional brain connectivity literature, the static temporal Pearson correlation has been the prevailing measure for functional connectivity. However, recent research has revealed temporally changing patterns of functional connectivity, leading to the study of dynamic functional connectivity. This motivates new similarity measures for pairs of random curves that reflect the dynamic features of functional similarity. Specifically, we introduce gradient synchronization measures in a general setting. These similarity measures are based on the concordance and discordance of the gradients between paired smooth random functions. Asymptotic normality of the proposed estimates is obtained under regularity conditions. We illustrate the proposed synchronization measures via simulations and an application to resting-state fMRI signals from the Alzheimer's Disease Neuroimaging Initiative and they are found to improve discrimination between subjects with different disease status.
Adaptive bootstrap tests for composite null hypotheses in the mediation pathway analysis
He Y, Song PXK and Xu G
Mediation analysis aims to assess if, and how, a certain exposure influences an outcome of interest through intermediate variables. This problem has recently gained a surge of attention due to the tremendous need for such analyses in scientific fields. Testing for the mediation effect (ME) is greatly challenged by the fact that the underlying null hypothesis (i.e. the absence of MEs) is composite. Most existing mediation tests are overly conservative and thus underpowered. To overcome this significant methodological hurdle, we develop an adaptive bootstrap testing framework that can accommodate different types of composite null hypotheses in the mediation pathway analysis. Applied to the product of coefficients test and the joint significance test, our adaptive testing procedures provide type I error control under the composite null, resulting in much improved statistical power compared to existing tests. Both theoretical properties and numerical examples of the proposed methodology are discussed.
Ensemble methods for testing a global null
Liu Y, Liu Z and Lin X
Testing a global null is a canonical problem in statistics and has a wide range of applications. In view of the fact that no uniformly most powerful test exists, prior and/or domain knowledge are commonly used to focus on a certain class of alternatives to improve the testing power. However, it is generally challenging to develop tests that are particularly powerful against a certain class of alternatives. In this paper, motivated by the success of ensemble learning methods for prediction or classification, we propose an ensemble framework for testing that mimics the spirit of random forests to deal with the challenges. Our ensemble testing framework aggregates a collection of weak base tests to form a final ensemble test that maintains strong and robust power for global nulls. We apply the framework to four problems about global testing in different classes of alternatives arising from Whole Genome Sequencing (WGS) association studies. Specific ensemble tests are proposed for each of these problems, and their theoretical optimality is established in terms of Bahadur efficiency. Extensive simulations and an analysis of a real WGS dataset are conducted to demonstrate the type I error control and/or power gain of the proposed ensemble tests.