Optimal Transport based Cross-Domain Integration for Heterogeneous Data
Detecting dynamic patterns shared across heterogeneous datasets is a critical yet challenging task in many scientific domains, particularly within the biomedical sciences. Systematic heterogeneity inherent in diverse data sources can significantly hinder the effectiveness of existing machine learning methods in uncovering shared underlying dynamics. Additionally, practical and technical constraints in real-world experimental designs often limit data collection to only a small number of subjects, even when rich, time-dependent measurements are available for each individual. These limited sample sizes further diminish the power to detect common dynamic patterns across subjects. In this article, we propose a novel heterogeneous data integration framework based on optimal transport to extract shared patterns in the conditional mean dynamics of target responses. The key advantage of the proposed method is its ability to enhance discriminative power by reducing heterogeneity unrelated to the signal. This is achieved through the alignment of extracted domain-shared temporal information across multiple datasets from different domains. Our approach is effective regardless of the number of datasets and does not require auxiliary matching information for alignment. Specifically, the method aligns longitudinal data from heterogeneous datasets within a common latent space, capturing shared dynamic patterns while leveraging temporal dependencies within subjects. Theoretically, we establish generalization error bounds for the proposed data integration approach in supervised learning tasks, highlighting a novel tradeoff between data alignment and pattern learning. Additionally, we derive convergence rates for the barycentric projection under Gromov-Wasserstein and fused Gromov-Wasserstein distances. Numerical studies on both simulated data and neuroscience applications demonstrate that the proposed data integration framework substantially improves prediction accuracy by effectively aggregating information across diverse data sources and subjects. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
Bayesian Geostatistics Using Predictive Stacking
We develop Bayesian predictive stacking for geostatistical models, where the primary inferential objective is to provide inference on the latent spatial random field and conduct spatial predictions at arbitrary locations. We exploit analytically tractable posterior distributions for regression coefficients of predictors and the realizations of the spatial process conditional upon process parameters. We subsequently combine such inference by stacking these models across the range of values of the hyper-parameters. We devise stacking of means and posterior densities in a manner that is computationally efficient without resorting to iterative algorithms such as Markov chain Monte Carlo (MCMC) and can exploit the benefits of parallel computations. We offer novel theoretical insights into the resulting inference within an infill asymptotic paradigm and through empirical results showing that stacked inference is comparable to full sampling-based Bayesian inference at a significantly lower computational cost.
Design-Based Uncertainty for Quasi-Experiments
Design-based frameworks of uncertainty are frequently used in settings where the treatment is (conditionally) randomly assigned. This article develops a design-based framework suitable for analyzing quasi-experimental settings in the social sciences, in which the treatment assignment can be viewed as the realization of some stochastic process but there is concern about unobserved selection into treatment. In our framework, treatments are stochastic, but units may differ in their probabilities of receiving treatment, thereby allowing for rich forms of selection. We provide conditions under which the estimands of popular quasi-experimental estimators correspond to interpretable finite-population causal parameters. We characterize the biases and distortions to inference that arise when these conditions are violated. These results can be used to conduct sensitivity analyses when there are concerns about selection into treatment. Taken together, our results establish a rigorous foundation for quasi-experimental analyses that more closely aligns with the way empirical researchers discuss the variation in the data. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
The Effect of Alcohol Intake on Brain White Matter Microstructural Integrity: A New Causal Inference Framework for Incomplete Phenomic Data
Although substance use, such as alcohol intake, is known to be associated with cognitive decline during aging, its direct influence on the central nervous system remains incompletely understood. In this study, we investigate the influence of alcohol intake frequency on reduction of brain white matter microstructural integrity in the fornix, a brain region considered a promising marker of age-related microstructural degeneration, using a large UK Biobank (UKB) cohort with extensive phenomic data reflecting a comprehensive lifestyle profile. Two major challenges arise: (a) potentially nonlinear confounding effects from phenomic variables and (b) a limited proportion of participants with complete phenomic data. To address these challenges, we develop a novel ensemble learning framework tailored for robust causal inference and introduce a data integration step to incorporate information from UKB participants with incomplete phenomic data, improving estimation efficiency. Our analysis reveals that daily alcohol intake may significantly reduce fractional anisotropy, a neuroimaging-derived measure of white matter structural integrity, in the fornix and increase systolic and diastolic blood pressure levels. Moreover, extensive numerical studies demonstrate the superiority of our method over competing approaches in terms of estimation bias, while outcome regression-based estimators may be preferred when minimizing mean squared error is prioritized. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
A Latent Variable Model for Individual Degree Measures in Respondent-Driven Sampling
Respondent-driven sampling (RDS) is widely used to collect data from hidden populations in social and biomedical science. Although RDS may provide comprehensive coverage of the target hidden population through social network recruitment, its non-random sampling process poses challenges for generalizing findings beyond the sample. Current analytical methods rely on the network size (degree) reported by respondents to adjust for unequal sampling probabilities. However, the accuracy of the reported degree is questionable due to reporting errors, evidenced through an unusual frequency of multiples of five and improbably large values. To address this measurement error, we leverage a byproduct of the RDS process (e.g., respondents' recruitment patterns) and develop a novel degree estimator based on a latent variable model of the true degree that accounts for response errors via a reporting mechanism and incorporates recruitment information and external demographic profiles. The effectiveness of the proposed method is demonstrated through a case study and a simulation study, which shows accurate and reliable degree estimates leading to significant improvements in population parameter estimation.
Simultaneous inference for generalized linear models with unmeasured confounders
Tens of thousands of simultaneous hypothesis tests are routinely performed in genomic studies to identify differentially expressed genes. However, due to unmeasured confounders, many standard statistical approaches may be substantially biased. This paper investigates the large-scale hypothesis testing problem for multivariate generalized linear models in the presence of confounding effects. Under arbitrary confounding mechanisms, we propose a unified statistical estimation and inference framework that harnesses orthogonal structures and integrates linear projections into three key stages. It begins by disentangling marginal and uncorrelated confounding effects to recover the latent coefficients. Subsequently, latent factors and primary effects are jointly estimated through lasso-type optimization. Finally, we incorporate projected and weighted bias-correction steps for hypothesis testing. Theoretically, we establish the identification conditions of various effects and non-asymptotic error bounds. We show effective Type-I error control of asymptotic -tests as sample and response sizes approach infinity. Numerical experiments demonstrate that the proposed method controls the false discovery rate by the Benjamini-Hochberg procedure and is more powerful than alternative methods. By comparing single-cell RNA-seq counts from two groups of samples, we demonstrate the suitability of adjusting confounding effects when significant covariates are absent from the model.
Causal Inference for Genomic Data with Multiple Heterogeneous Outcomes
With the evolution of single-cell RNA sequencing techniques into a standard approach in genomics, it has become possible to conduct cohort-level causal inferences based on single-cell-level measurements. However, the individual gene expression levels of interest are not directly observable; instead, only repeated proxy measurements from each individual's cells are available, providing a derived outcome to estimate the underlying outcome for each of many genes. In this paper, we propose a generic semiparametric inference framework for doubly robust estimation with multiple derived outcomes, which also encompasses the usual setting of multiple outcomes when the response of each unit is available. To reliably quantify the causal effects of heterogeneous outcomes, we specialize the analysis to standardized average treatment effects and quantile treatment effects. Through this, we demonstrate the use of the semiparametric inferential results for doubly robust estimators derived from both Von Mises expansions and estimating equations. A multiple testing procedure based on Gaussian multiplier bootstrap is tailored for doubly robust estimators to control the false discovery exceedance rate. Applications in single-cell CRISPR perturbation analysis and individual-level differential expression analysis demonstrate the utility of the proposed methods and offer insights into the usage of different estimands for causal inference in genomics.
Inferring Covariance Structure from Multiple Data Sources via Subspace Factor Analysis
Factor analysis provides a canonical framework for imposing lower-dimensional structure such as sparse covariance in high-dimensional data. High-dimensional data on the same set of variables are often collected under different conditions, for instance in reproducing studies across research groups. In such cases, it is natural to seek to learn the shared versus condition-specific structure. Existing hierarchical extensions of factor analysis have been proposed, but face practical issues including identifiability problems. To address these shortcomings, we propose a class of SUbspace Factor Analysis (SUFA) models, which characterize variation across groups at the level of a lower-dimensional subspace. We prove that the proposed class of SUFA models lead to identifiability of the shared versus group-specific components of the covariance, and study their posterior contraction properties. Taking a Bayesian approach, these contributions are developed alongside efficient posterior computation algorithms. Our sampler fully integrates out latent variables, is easily parallelizable and has complexity that does not depend on sample size. We illustrate the methods through application to integration of multiple gene expression datasets relevant to immunology.
: A Principled Approach to Characterizing the Underrepresented Population
Randomized controlled trials (RCTs) serve as the cornerstone for understanding causal effects, yet extending inferences to target populations presents challenges due to effect heterogeneity and underrepresentation. Our paper addresses the critical issue of identifying and characterizing underrepresented subgroups in RCTs, proposing a novel framework for refining target populations to improve generalizability. We introduce an optimization-based approach, Rashomon Set of Optimal Trees (ROOT), to characterize underrepresented groups. ROOT optimizes the target subpopulation distribution by minimizing the variance of the target average treatment effect estimate, ensuring more precise treatment effect estimations. Notably, ROOT generates interpretable characteristics of the underrepresented population, aiding researchers in effective communication. Our approach demonstrates improved precision and interpretability compared to alternatives, as illustrated with synthetic data experiments. We apply our methodology to extend inferences from the Starting Treatment with Agonist Replacement Therapies (START) trial - investigating the effectiveness of medication for opioid use disorder - to the real-world population represented by the Treatment Episode Dataset: Admissions (TEDS-A). By refining target populations using ROOT, our framework offers a systematic approach to enhance decision-making accuracy and inform future trials in diverse populations.
Estimation and Inference of Quantile Spatially Varying Coefficient Models Over Complicated Domains
This paper presents a flexible quantile spatially varying coefficient model (QSVCM) for the regression analysis of spatial data. The proposed model enables researchers to assess the dependence of conditional quantiles of the response variable on covariates while accounting for spatial nonstationarity. Our approach facilitates learning and interpreting heterogeneity in spatial data distributed over complex or irregular domains. We introduce a quantile regression method that utilizes bivariate penalized splines in triangulation to estimate unknown functional coefficients. We establish the convergence of the proposed estimators, demonstrating their optimal convergence rate under certain regularity conditions. An efficient optimization algorithm is developed using the alternating direction method of multipliers (ADMM). We develop wild residual bootstrap-based pointwise confidence intervals for the QSVCM quantile coefficients. Furthermore, we construct reliable conformal prediction intervals for the response variable using the proposed QSVCM. Simulation studies show the remarkable performance of the proposed methods. Lastly, we illustrate the practical applicability of our methods by analyzing the mortality dataset and the supplementary particulate matter (PM) dataset in the United States.
Dynamic Regression of Longitudinal Trajectory Features
Chronic disease studies often collect data on biological and clinical markers at follow-up visits to monitor disease progression. Viewing such longitudinal measurements governed by latent continuous trajectories, we develop a new dynamic regression framework to investigate the heterogeneity pattern of certain features of the latent individual trajectory that may carry substantive information on disease risk or status. Employing the strategy of multi-level modeling, we formulate the latent individual trajectory feature of interest through a flexible pseudo B-spline model with subject-specific random parameters, and then link it with the observed covariates through quantile regression, avoiding restrictive parametric distributional assumptions that are typically required by standard multi-level longitudinal models. We propose an estimation procedure from adapting the principle of conditional score and develop an efficient algorithm for implementation. Our proposals yield estimators with desirable asymptotic properties as well as good finite-sample performance as confirmed by extensive simulation studies. An application of the proposed method to a cohort of participants with mild cognitive impairment (MCI) in the Uniform Data Set (UDS) provides useful insights about the complex heterogeneous presentations of cognitive decline in MCI patients.
Data fusion using weakly aligned sources
We introduce a new data fusion method that utilizes multiple data sources to estimate a smooth, finite-dimensional parameter. Most existing methods only make use of fully aligned data sources that share common conditional distributions of one or more variables of interest. However, in many settings, the scarcity of fully aligned sources can make existing methods require unduly large sample sizes to be useful. Our approach enables the incorporation of weakly aligned data sources that are not perfectly aligned, provided their degree of misalignment is known up to finite-dimensional parameters. We quantify the additional efficiency gains achieved through the integration of these weakly aligned sources. We characterize the semiparametric efficiency bound and provide a general means to construct estimators achieving these efficiency gains. We illustrate our results by fusing data from two harmonized HIV monoclonal antibody prevention efficacy trials to study how a neutralizing antibody biomarker associates with HIV genotype.
Robustifying Likelihoods by Optimistically Re-weighting Data
Likelihood-based inferences have been remarkably successful in wide-spanning application areas. However, even after due diligence in selecting a good model for the data at hand, there is inevitably some amount of model misspecification: outliers, data contamination or inappropriate parametric assumptions such as Gaussianity mean that most models are at best rough approximations of reality. A significant practical concern is that for certain inferences, even small amounts of model misspecification may have a substantial impact; a problem we refer to as . This article attempts to address the brittleness problem in likelihood-based inferences by choosing the most model friendly data generating process in a distance-based neighborhood of the empirical measure. This leads to a new Optimistically Weighted Likelihood (OWL), which robustifies the original likelihood by formally accounting for a small amount of model misspecification. Focusing on total variation (TV) neighborhoods, we study theoretical properties, develop estimation algorithms and illustrate the methodology in applications to mixture models and regression.
Estimating Heterogeneous Exposure Effects in the Case-Crossover Design using BART
Epidemiological approaches for examining human health responses to environmental exposures in observational studies often control for confounding by implementing clever matching schemes and using statistical methods based on conditional likelihood. Nonparametric regression models have surged in popularity in recent years as a tool for estimating individual-level heterogeneous effects, which provide a more detailed picture of the exposure-response relationship but can also be aggregated to obtain improved marginal estimates at the population level. In this work we incorporate Bayesian additive regression trees (BART) into the conditional logistic regression model to identify heterogeneous exposure effects in a case-crossover design. Conditional logistic BART (CL-BART) utilizes reversible jump Markov chain Monte Carlo to bypass the conditional conjugacy requirement of the original BART algorithm. Our work is motivated by the growing interest in identifying subpopulations more vulnerable to environmental exposures. We apply CL-BART to a study of the impact of heat waves on people with Alzheimer's disease in California and effect modification by other chronic conditions. Through this application, we also describe strategies to examine heterogeneous odds ratios through variable importance, partial dependence, and lower-dimensional summaries.
Federated Adaptive Causal Estimation (FACE) of Target Treatment Effects
Federated learning of causal estimands may greatly improve estimation efficiency by leveraging data from multiple study sites, but robustness to heterogeneity and model misspecifications is vital for ensuring validity. We develop a Federated Adaptive Causal Estimation (FACE) framework to incorporate heterogeneous data from multiple sites to provide treatment effect estimation and inference for a flexibly specified target population of interest. FACE accounts for site-level heterogeneity in the distribution of covariates through density ratio weighting. To safely incorporate source sites and avoid negative transfer, we introduce an adaptive weighting procedure via a penalized regression, which achieves both consistency and optimal efficiency. Our strategy is communication-efficient and privacy-preserving, allowing participating sites to share summary statistics only once with other sites. We conduct both theoretical and numerical evaluations of FACE and apply it to conduct a comparative effectiveness study of BNT162b2 (Pfizer) and mRNA-1273 (Moderna) vaccines on COVID-19 outcomes in U.S. veterans using electronic health records from five VA regional sites. We show that compared to traditional methods, FACE meaningfully increases the precision of treatment effect estimates, with reductions in standard errors ranging from 26% to 67%.
Comparison of Longitudinal Trajectories Using a High-dimensional Partial Linear Semiparametric Mixed-Effects Model
In longitudinal research, it is essential to compare sets of trajectories, commonly seen as changes over time in different treatment or patient groups. This paper presents a partial linear semiparametric mixed-effects model (PLSMM) for the analysis and comparison of nonlinear longitudinal trajectories with high-dimensional covariates across groups. Our flexible modeling framework can effectively handle complex temporal effects and extensive data while providing statistical inference. This method is particularly useful for evaluating differences in both linear and nonlinear components between groups, with a key strength being its ability to model nonlinear patterns without requiring prior knowledge of the functional forms. Instead, it employs a dictionary search strategy to automatically select appropriate basis functions to capture the nonlinear trends. This approach is also capable of handling longitudinal observations with irregular time points. A novel debiasing procedure is proposed for the post-selection inference on the linear components of PLSMM, and a bootstrap method is used for the comparison of nonlinear components. The model has been tested in different simulation settings and applied to a cohort study examining the evolution of oral concentration in young children from birth to two years of age in different racial groups.
Identifying genetic variants for brain connectivity using Ball Covariance Ranking and Aggregation
Understanding the genetic architecture of brain functions is essential to clarify the biological etiologies of behavioral and psychiatric disorders. Functional connectivity, representing pairwise correlations of neural activities between brain regions, is moderately heritable. Current methods to identify single nucleotide polymorphisms (SNPs) linked to functional connectivity either neglect the complex structure of functional connectivity or fail to control false discoveries. Therefore, we propose a SNP-set hypothesis test, Ball Covariance Ranking and Aggregation (BCRA), to select and test the significance of SNP sets related to functional connectivity, incorporating matrix structure and controlling false discovery rate. Additionally, we present subsample-BCRA, a faster version for large-scale datasets. Simulation studies show both methods effectively detect SNPs with interactive structures, with subsample-BCRA shortens the running time by 700 folds. Applying our method to UK Biobank data from 34,129 individuals, we identify 10 SNP-sets with 29 SNPs significantly impacting functional connectivity. Gene-based analyses reveal three SNPs as eQTLs of gene , known to change functional connectivity. We also detect nine novel genes associated with behavioral and psychiatric disorders, whose connections to brain functions remain unexplored. Our findings improve our understanding of the genetic basis for brain connectivity and showcase our method's utility for broader applications.
Bayesian Clustering via Fusing of Localized Densities
Bayesian clustering typically relies on mixture models, with each component interpreted as a different cluster. After defining a prior for the component parameters and weights, Markov chain Monte Carlo (MCMC) algorithms are commonly used to produce samples from the posterior distribution of the component labels. The data are then clustered by minimizing the expectation of a clustering loss function that favors similarity to the component labels. Unfortunately, although these approaches are routinely implemented, clustering results are highly sensitive to kernel misspecification. For example, if Gaussian kernels are used but the true density of data within a cluster is even slightly non-Gaussian, then clusters will be broken into multiple Gaussian components. To address this problem, we develop Fusing of Localized Densities (FOLD), a novel clustering method that melds components together using the posterior of the kernels. FOLD has a fully Bayesian decision theoretic justification, naturally leads to uncertainty quantification, can be easily implemented as an add-on to MCMC algorithms for mixtures, and favors a small number of distinct clusters. We provide theoretical support for FOLD including clustering optimality under kernel misspecification. In simulated experiments and real data, FOLD outperforms competitors by minimizing the number of clusters while inferring meaningful group structure. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
An efficient coalescent model for heterochronously sampled molecular data
Molecular sequence variation at a locus informs about the evolutionary history of the sample and past population size dynamics. The Kingman coalescent is used in a generative model of molecular sequence variation to infer evolutionary parameters. However, it is well understood that inference under this model does not scale well with sample size. Here, we build on recent work based on a lower resolution coalescent process, the Tajima coalescent, to model longitudinal samples. While the Kingman coalescent models the ancestry of labeled individuals, we model the ancestry of individuals labeled by their sampling time. We propose a new inference scheme for the reconstruction of effective population size trajectories based on this model and the infinite-sites mutation model. Modeling of longitudinal samples is necessary for applications (, ancient DNA and RNA from rapidly evolving pathogens like viruses) and statistically desirable (variance reduction and parameter identifiability). We propose an efficient algorithm to calculate the likelihood and employ a Bayesian nonparametric procedure to infer the population size trajectory. We provide a new MCMC sampler to explore the space of heterochronous Tajima's genealogies and model parameters. We compare our procedure with state-of-the-art methodologies in simulations and an application to ancient bison DNA sequences.
Partial Quantile Tensor Regression
Tensors, characterized as multidimensional arrays, are frequently encountered in modern scientific studies. Quantile regression has the unique capacity to explore how a tensor covariate influences different segments of the response distribution. In this work, we propose a partial quantile tensor regression (PQTR) framework, which novelly applies the core principle of the partial least squares technique to achieve effective dimension reduction for quantile regression with a tensor covariate. The proposed PQTR algorithm is computationally efficient and scalable to a large tensor covariate. Moreover, we uncover an appealing latent variable model representation for the PQTR algorithm, justifying a simple population interpretation of the resulting estimator. We further investigate the connection of the PQTR procedure with an envelope quantile tensor regression (EQTR) model, which defines a general set of sparsity conditions tailored to quantile tensor regression. We prove the root- consistency of the PQTR estimator under the EQTR model, and demonstrate its superior finite-sample performance compared to benchmark methods through simulation studies. We demonstrate the practical utility of the proposed method via an application to a neuroimaging study of post traumatic stress disorder (PTSD). Results derived from the proposed method are more neurobiologically meaningful and interpretable as compared to those from existing methods.
Matrix Completion When Missing Is Not at Random and Its Applications in Causal Panel Data Models
This paper develops an inferential framework for matrix completion when missing is not at random and without the requirement of strong signals. Our development is based on the observation that if the number of missing entries is small enough compared to the panel size, then they can be estimated well even when missing is not at random. Taking advantage of this fact, we divide the missing entries into smaller groups and estimate each group via nuclear norm regularization. In addition, we show that with appropriate debiasing, our proposed estimate is asymptotically normal even for fairly weak signals. Our work is motivated by recent research on the Tick Size Pilot Program, an experiment conducted by the Security and Exchange Commission (SEC) to evaluate the impact of widening the tick size on the market quality of stocks from 2016 to 2018. While previous studies were based on traditional regression or difference-in-difference methods by assuming that the treatment effect is invariant with respect to time and unit, our analyses suggest significant heterogeneity across units and intriguing dynamics over time during the pilot program.
