Statistics and Its Interface

Pathway Lasso: Pathway Estimation and Selection with High-Dimensional Mediators
Zhao Y and Luo X
In many scientific studies, it becomes increasingly important to delineate the pathways through a large number of mediators, such as genetic and brain mediators. Structural equation modeling (SEM) is a popular technique to estimate the pathway effects, commonly expressed as the product of coefficients. However, it becomes unstable and computationally challenging to fit such models with high-dimensional mediators. This paper proposes a sparse mediation model using a regularized SEM approach, where sparsity means that a small number of mediators have a nonzero mediation effect between a treatment and an outcome. To address the model selection challenge, we innovate by introducing a new penalty called . This penalty function is a convex relaxation of the non-convex product function for the mediation effects, and it enables a computationally tractable optimization criterion to estimate and select pathway effects simultaneously. We develop a fast ADMM-type algorithm to compute the model parameters, and we show that the iterative updates can be expressed in closed form. We also prove the asymptotic consistency of our Pathway Lasso estimator for the mediation effect. On both simulated data and an fMRI data set, the proposed approach yields higher pathway selection accuracy and lower estimation bias than competing methods.
Covariate-adjusted hybrid principal components analysis for region-referenced functional EEG data
Scheffler AW, Dickinson A, DiStefano C, Jeste S and Şentürk D
Electroencephalography (EEG) studies produce region-referenced functional data via EEG signals recorded across scalp electrodes. The high-dimensional data can be used to contrast neurodevelopmental trajectories between diagnostic groups, for example between typically developing (TD) children and children with autism spectrum disorder (ASD). Valid inference requires characterization of the complex EEG dependency structure as well as covariate-dependent heteroscedasticity, such as changes in variation over developmental age. In our motivating study, EEG data is collected on TD and ASD children aged two to twelve years old. The peak alpha frequency, a prominent peak in the alpha spectrum, is a biomarker linked to neurodevelopment that shifts as children age. To retain information, we model patterns of alpha spectral variation, rather than just the peak location, regionally across the scalp and chronologically across development. We propose a covariate-adjusted hybrid principal components analysis (CA-HPCA) for EEG data, which utilizes both vector and functional principal components analysis while simultaneously adjusting for covariate-dependent heteroscedasticity. CA-HPCA assumes the covariance process is weakly separable conditional on observed covariates, allowing for covariate-adjustments to be made on the marginal covariances rather than the full covariance leading to stable and computationally efficient estimation. The proposed methodology provides novel insights into neurodevelopmental differences between TD and ASD children.
When to initiate cancer screening exam?
Wu D
A probability method is developed to decide when to initiate cancer screening for asymptomatic individuals. The probability of incidence is a function of screening sensitivity, time duration in the disease-free state and sojourn time in the preclinical state; and it is monotonically increasing as time increases, given a person's current age. So a unique solution of the first screening time can be found by limiting this probability to a small value, such as 10% or 20%. That is, with 90% or 80% probability, one will not be a clinical incident case before the first exam. After this age is found, we can further estimate the lead time distribution and probability of over-diagnosis if one would be diagnosed with cancer at the first exam. Simulations were carried out under different scenarios; and the method was applied to two heavy smoker cohorts in the National Lung Screening Trial using low-dose computerized tomography. The method is applicable to other kinds of cancer screening. The predictive information can be used by physicians or individuals at risk to make informed decisions on when to initiate screening.
Estimation of Preclinical State Onset Age and Sojourn Time for Heavy Smokers in Lung Cancer
Wu D, Rai SN and Seow A
Estimation of the three key parameters: onset age of the preclinical state, sojourn time and screening sensitivity is critical in cancer screening, since all other terms are functions of the three. A novel link function to connect sensitivity with time in the preclinical state and the likelihood method were used in this project; since sensitivity depends on how long one has entered the preclinical state relative to the total sojourn time. Simulations using Markov Chain Monte Carlo and maximum likelihood estimate were carried out to estimate the key parameters for male and female heavy smokers separately in the low-dose computed tomography group of the National Lung Screening Trial. Sensitivity for male and female heavy smokers were 0.883 and 0.915 respectively at the onset of the preclinical state, and increased to 0.972 and 0.981 at the end. The mean age to make the transition into the preclinical state was 70.94 or 71.15 for male and female heavy smokers respectively, and 90% of heavy smokers at risk for lung cancer would enter the preclinical state in age interval (55.7, 85.8) for males and (54.2, 87.7) for females, and the transition peaked around age 69 for both genders. The mean sojourn time in the preclinical state was 1.43 and 1.49 years, and the 99% credible intervals for the sojourn time were (0.21, 2.96) and (0.37, 2.69) years for male and female heavy smokers correspondingly. Based on the result, low-dose CT should be started at age 55 and ended before 85 for heavy smokers. This provided important information to policy makers.
On evidence cycles in network meta-analysis
Lin L, Chu H and Hodges JS
As an extension of pairwise meta-analysis of two treatments, network meta-analysis has recently attracted many researchers in evidence-based medicine because it simultaneously synthesizes both direct and indirect evidence from multiple treatments and thus facilitates better decision making. The Bayesian hierarchical model is a popular method to implement network meta-analysis, and it is generally considered more powerful than conventional pairwise meta-analysis, leading to more precise effect estimates with narrower credible intervals. However, the improvement of effect estimates produced by Bayesian network meta-analysis has never been studied theoretically. This article shows that such improvement depends highly on evidence cycles in the treatment network. When all treatment comparisons are assumed to have different heterogeneity variances, a network meta-analysis produces posterior distributions identical to separate pairwise meta-analyses for treatment comparisons that are not contained in any evidence cycles. However, this equivalence does not hold under the commonly-used assumption of a common heterogeneity variance for all comparisons. Simulations and a case study are used to illustrate the equivalence of the Bayesian network and pairwise meta-analyses in certain networks.
A Double Regression Method for Graphical Modeling of High-dimensional Nonlinear and Non-Gaussian Data
Liang S and Liang F
Graphical models have long been studied in statistics as a tool for inferring conditional independence relationships among a large set of random variables. The most existing works in graphical modeling focus on the cases that the data are Gaussian or mixed and the variables are linearly dependent. In this paper, we propose a double regression method for learning graphical models under the high-dimensional nonlinear and non-Gaussian setting, and prove that the proposed method is consistent under mild conditions. The proposed method works by performing a series of nonparametric conditional independence tests. The conditioning set of each test is reduced via a double regression procedure where a model-free sure independence screening procedure or a sparse deep neural network can be employed. The numerical results indicate that the proposed method works well for high-dimensional nonlinear and non-Gaussian data.
Adaptive Clustering and Feature Selection for Categorical Time Series Using Interpretable Frequency-Domain Features
Bruce SA
This article presents a novel approach to clustering and feature selection for categorical time series via interpretable frequency-domain features. A distance measure is introduced based on the spectral envelope and optimal scalings, which parsimoniously characterize prominent cyclical patterns in categorical time series. Using this distance, partitional clustering algorithms are introduced for accurately clustering categorical time series. These adaptive procedures offer simultaneous feature selection for identifying important features that distinguish clusters and fuzzy membership when time series exhibit similarities to multiple clusters. Clustering consistency of the proposed methods is investigated, and simulation studies are used to demonstrate clustering accuracy with various underlying group structures. The proposed methods are used to cluster sleep stage time series for sleep disorder patients in order to identify particular oscillatory patterns associated with sleep disruption.
A CD-based mapping method for combining multiple related parameters from heterogeneous intervention trials
Jiao Y, Mun EY, Trikalinos TA and Xie M
Effect size can differ as a function of the elapsed time since treatment or as a function of other key covariates, such as sex or age. In evidence synthesis, a better understanding of the precise conditions under which treatment does work or does not work well has been highly valued. With increasingly accessible individual patient or participant data (IPD), more precise and informative inference can be within our reach. However, simultaneously combining multiple related parameters across heterogeneous studies is challenging because each parameter from each study has a specific interpretation within the context of the study and other covariates in the model. This paper proposes a novel mapping method to combine study-specific estimates of multiple related parameters across heterogeneous studies, which ensures valid inference at all inference levels by combining sample-dependent functions known as Confidence Distributions (CD). We describe the "CD-based mapping method" and provide a data application example for a multivariate random-effects meta-analysis model. We estimated up to 13 study-specific regression parameters for each of 14 individual studies using IPD in the first step, and subsequently combined the study-specific vectors of parameters, yielding a full vector of hyperparameters in the second step of meta-analysis. Sensitivity analysis indicated that the CD-based mapping method is robust to model misspecification. This novel approach to multi-parameter synthesis provides a reasonable methodological solution when combining complex evidence using IPD.
Imaging mediation analysis for longitudinal outcomes: a case study of childhood brain tumor survivorship
Li Y, Wang JX, Zhou GC, Conklin HM, Onar-Thomas A, Gajjar A, Reddick WE and Li C
Aggressive cancer treatments that affect the central nervous system are associated with an increased risk of cognitive deficits. As treatment for pediatric brain tumors has become more effective, there has been a heightened focus on improving cognitive outcomes, which can significantly affect the quality of life for pediatric cancer survivors. This paper is motivated by and applied to a clinical trial for medulloblastoma, the most common malignant brain tumor in children. The trial collects comprehensive data including treatment-related clinical information, neuroimaging, and longitudinal neurocognitive outcomes to enhance our understanding of the responses to treatment and the enduring impacts of radiation therapy on the survivors of medulloblastoma. To this end, we have developed a new mediation model tailored for longitudinal outcomes with high-dimensional imaging mediators. Specifically, we adopt a joint binary Ising-Gaussian Markov random field prior distribution to account for spatial dependency and smoothness of ultra-high-dimensional neuroimaging mediators for enhancing detection power of informative voxels. By exploiting the proposed approach, we identify causal pathways and the corresponding white matter microstructures mediating the negative impact of irradiation on neurodevelopment. The results provide guidance on sparing the brain regions and improving long-term neurodevelopment for pediatric cancer survivors. Simulation studies also confirm the validity of the proposed method.
Confidence in the treatment decision for an individual patient: strategies for sequential assessment
Orwitz N, Tarpey T and Petkova E
Evolving medical technologies have motivated the development of treatment decision rules (TDRs) that incorporate complex, costly data (e.g., imaging). In clinical practice, we aim for TDRs to be valuable by reducing unnecessary testing while still identifying the best possible treatment for a patient. Regardless of how well any TDR performs in the target population, there is an associated degree of uncertainty about its optimality for a specific patient. In this paper, we aim to quantify, via a confidence measure, the uncertainty in a TDR as patient data from sequential procedures accumulate in real-time. We first propose estimating confidence using the distance of a patient's vector of covariates to a treatment decision boundary, with further distances corresponding to higher certainty. We further propose measuring confidence through the conditional probabilities of ultimately (with all possible information available) being assigned a particular treatment, given that the same treatment is assigned with the patient's currently available data or given the treatment recommendation made using only the currently available patient data. As patient data accumulate, the treatment decision is updated and confidence reassessed until a sufficiently high confidence level is achieved. We present results from simulation studies and illustrate the methods using a motivating example from a depression clinical trial. Recommendations for practical use of the measures are proposed.
Latent Class Proportional Hazards Regression with Heterogeneous Survival Data
Fei T, Hanfelt JJ and Peng L
Heterogeneous survival data are commonly present in chronic disease studies. Delineating meaningful disease subtypes directly linked to a survival outcome can generate useful scientific implications. In this work, we develop a latent class proportional hazards (PH) regression framework to address such an interest. We propose mixture proportional hazards modeling, which flexibly accommodates class-specific covariate effects while allowing for the baseline hazard function to vary across latent classes. Adapting the strategy of nonparametric maximum likelihood estimation, we derive an Expectation-Maximization (E-M) algorithm to estimate the proposed model. We establish the theoretical properties of the resulting estimators. Extensive simulation studies are conducted, demonstrating satisfactory finite-sample performance of the proposed method as well as the predictive benefit from accounting for the heterogeneity across latent classes. We further illustrate the practical utility of the proposed method through an application to a mild cognitive impairment (MCI) cohort in the Uniform Data Set.
Bayesian tensor-on-tensor regression with efficient computation
Wang K and Xu Y
We propose a Bayesian tensor-on-tensor regression approach to predict a multidimensional array (tensor) of arbitrary dimensions from another tensor of arbitrary dimensions, building upon the Tucker decomposition of the regression coefficient tensor. Traditional tensor regression methods making use of the Tucker decomposition either assume the dimension of the core tensor to be known or estimate it via cross-validation or some model selection criteria. However, no existing method can simultaneously estimate the model dimension (the dimension of the core tensor) and other model parameters. To fill this gap, we develop an efficient Markov Chain Monte Carlo (MCMC) algorithm to estimate both the model dimension and parameters for posterior inference. Besides the MCMC sampler, we also develop an ultra-fast optimization-based computing algorithm wherein the maximum a posteriori estimators for parameters are computed, and the model dimension is optimized via a simulated annealing algorithm. The proposed Bayesian framework provides a natural way for uncertainty quantification. Through extensive simulation studies, we evaluate the proposed Bayesian tensor-on-tensor regression model and show its superior performance compared to alternative methods. We also demonstrate its practical effectiveness by applying it to two real-world datasets, including facial imaging data and 3D motion data.
Multi-way overlapping clustering by Bayesian tensor decomposition
Wang Z, Zhou F, He K and Ni Y
The development of modern sequencing technologies provides great opportunities to measure gene expression of multiple tissues from different individuals. The three-way variation across genes, tissues, and individuals makes statistical inference a challenging task. In this paper, we propose a Bayesian multi-way clustering approach to cluster genes, tissues, and individuals simultaneously. The proposed model adaptively trichotomizes the observed data into three latent categories and uses a Bayesian hierarchical construction to further decompose the latent variables into lower-dimensional features, which can be interpreted as overlapping clusters. With a Bayesian nonparametric prior, i.e., the Indian buffet process, our method determines the cluster number automatically. The utility of our approach is demonstrated through simulation studies and an application to the Genotype-Tissue Expression (GTEx) RNA-seq data. The clustering result reveals some interesting findings about depression-related genes in human brain, which are also consistent with biological domain knowledge. The detailed algorithm and some numerical results are available in the online Supplementary Material, http://intlpress.com/site/pub/files/-supp/sii/2024/0017/0002/sii-2024-0017-0002-s001.pdf.
Extracting scalar measures from functional data with applications to placebo response
Tarpey T, Petkova E, Ciarleglio A and Ogden RT
In controlled and observational studies, outcome measures are often observed longitudinally. Such data are difficult to compare among units directly because there is no natural ordering of curves. This is relevant not only in clinical trials, where typically the goal is to evaluate the relative efficacy of treatments on average, but also in the growing and increasingly important area of personalized medicine, where treatment decisions are optimized with respect to a relevant patient outcome. In personalized medicine, there are no methods for optimizing treatment decision rules using longitudinal outcomes, e.g., symptom trajectories, because of the lack of a natural ordering of curves. A typical practice is to summarize the longitudinal response by a scalar outcome that can then be compared across patients, treatments, etc. We describe some of the summaries that are in common use, especially in clinical trials. We consider a general summary measure (weighted average tangent slope) with weights that can be chosen to optimize specific inference depending on the application. We illustrate the methodology on a study of depression treatment, in which it is difficult to separate placebo effects from the specific effects of the antidepressant. We argue that this approach provides a better summary for estimating the benefits of an active treatment than traditional non-weighted averages.
Smooth online parameter estimation for time varying VAR models with application to rat local field potential activity data
El Yaagoubi Bourakna A, Pinto M, Fortin N and Ombao H
Multivariate time series data appear often as realizations of non-stationary processes where the covariance matrix or spectral matrix smoothly evolve over time. Most of the current approaches estimate the time-varying spectral properties only retrospectively - that is, after the entire data has been observed. Retrospective estimation is a major limitation in many adaptive control applications where it is important to estimate these properties and detect changes in the system as they happen in real-time. To overcome this limitation, we develop an online estimation procedure that gives a real-time update of the time-varying parameters as new observations arrive. One approach to modeling non-stationary time series is to fit time-varying vector autoregressive models (tv-VAR). However, one major obstacle in online estimation of such models is the computational cost due to the high-dimensionality of the parameters. Existing methods such as the Kalman filter or local least squares are feasible. However, they are not always suitable because they provide noisy estimates and can become prohibitively costly as the dimension of the time series increases. In our brain signal application, it is critical to develop a robust method that can estimate, in real-time, the properties of the underlying stochastic process, in particular, the spectral brain connectivity measures. For these reasons we propose a new smooth online parameter estimation approach (SOPE) that has the ability to control for the smoothness of the estimates with a reasonable computational complexity. Consequently, the models are fit in real-time even for high dimensional time series. We demonstrate that our proposed SOPE approach is as good as the Kalman filter in terms of mean-squared error for small dimensions. However, unlike the Kalman filter, the SOPE has lower computational cost and hence scalable for higher dimensions. Finally, we apply the SOPE method to local field potential activity data from the hippocampus of a rat performing an odor sequence memory task. As demonstrated in the video, the proposed SOPE method is able to capture the dynamics of the connectivity as the rat samples the different odor stimuli.
Statistical Methods for Quantifying Between-study Heterogeneity in Meta-analysis with Focus on Rare Binary Events
Zhang C, Chen M and Wang X
Meta-analysis, the statistical procedure for combining results from multiple independent studies, has been widely used in medical research to evaluate intervention efficacy and drug safety. In many practical situations, treatment effects vary notably among the collected studies, and the variation, often modeled by the between-study variance parameter , can greatly affect the inference of the overall effect size. In the past, comparative studies have been conducted for both point and interval estimation of . However, most are incomplete, only including a limited subset of existing methods, and some are outdated. Further, none of the studies covers descriptive measures for assessing the level of heterogeneity, nor are they focused on rare binary events that require special attention. We summarize by far the most comprehensive set including 11 descriptive measures, 23 estimators, and 16 confidence intervals. In addition to providing synthesized information, we further categorize these methods according to their key features. We then evaluate their performance based on simulation studies that examine various realistic scenarios for rare binary events, with an illustration using a data example of a gestational diabetes meta-analysis. We conclude that there is no uniformly "best" method. However, methods with consistently better performance do exist in the context of rare binary events, and we provide practical guidelines based on numerical evidences.
Estimating individualized treatment rules for multicategory type 2 diabetes treatments using electronic health records
Lou J, Wang Y, Li L and Zeng D
In this article, we propose a general framework to learn optimal treatment rules for type 2 diabetes (T2D) patients using electronic health records (EHRs). We first propose a joint modeling approach to characterize patient's pretreatment conditions using longitudinal markers from EHRs. The estimation accounts for informative measurement times using inverse-intensity weighting methods. The predicted latent processes in the joint model are used to divide patients into a finite of subgroups and, within each group, patients share similar health profiles in EHRs. Within each patient group, we estimate optimal individualized treatment rules by extending a matched learning method to handle multicategory treatments using a one-versus-one approach. Each matched learning for two treatments is implemented by a weighted support vector machine with matched pairs of patients. We apply our method to estimate optimal treatment rules for T2D patients in a large sample of EHRs from the Ohio State University Wexner Medical Center. We demonstrate the utility of our method to select the optimal treatments from four classes of drugs and achieve a better control of glycated hemoglobin than any one-size-fits-all rules.
Bayesian flexible hierarchical skew heavy-tailed multivariate meta regression models for individual patient data with applications
Kim S, Chen MH, Ibrahim J, Shah A and Lin J
A flexible class of multivariate meta-regression models are proposed for Individual Patient Data (IPD). The methodology is well motivated from 26 pivotal Merck clinical trials that compare statins (cholesterol lowering drugs) in combination with ezetimibe and statins alone on treatment-naïve patients and those continuing on statins at baseline. The research goal is to jointly analyze the multivariate outcomes, Low Density Lipoprotein Cholesterol (LDL-C), High Density Lipoprotein Cholesterol (HDL-C), and Triglycerides (TG). These three continuous outcome measures are correlated and shed much light on a subject's lipid status. The proposed multivariate meta-regression models allow for different skewness parameters and different degrees of freedom for the multivariate outcomes from different trials under a general class of skew t-distributions. The theoretical properties of the proposed models are examined and an efficient Markov chain Monte Carlo (MCMC) sampling algorithm is developed for carrying out Bayesian inference under the proposed multivariate meta-regression model. In addition, the Conditional Predictive Ordinates (CPOs) are computed via an efficient Monte Carlo method. Consequently, the logarithm of the pseudo marginal likelihood and Bayesian residuals are obtained for model comparison and assessment, respectively. A detailed analysis of the IPD meta data from the 26 Merck clinical trials is carried out to demonstrate the usefulness of the proposed methodology.
The more data, the better? Demystifying deletion-based methods in linear regression with missing data
Xu T, Chen K and Li G
We compare two deletion-based methods for dealing with the problem of missing observations in linear regression analysis. One is the complete-case analysis (CC, or listwise deletion) that discards all incomplete observations and only uses common samples for ordinary least-squares estimation. The other is the available-case analysis (AC, or pairwise deletion) that utilizes all available data to estimate the covariance matrices and applies these matrices to construct the normal equation. We show that the estimates from both methods are asymptotically unbiased under missing completely at random (MCAR) and further compare their asymptotic variances in some typical situations. Surprisingly, using more data (i.e., AC) does not necessarily lead to better asymptotic efficiency in many scenarios. Missing patterns, covariance structure and true regression coefficient values all play a role in determining which is better. We further conduct simulation studies to corroborate the findings and demystify what has been missed or misinterpreted in the literature. Some detailed proofs and simulation results are available in the online supplemental materials.
Bayesian Meta-Regression Model Using Heavy-Tailed Random-effects with Missing Sample Sizes for Self-thinning Meta-data
Ma Z, Chen MH and Tang Y
Motivated by the self-thinning meta-data, a random-effects meta-analysis model with unknown precision parameters is proposed with a truncated Poisson regression model for missing sample sizes. The random effects are assumed to follow a heavy-tailed distribution to accommodate outlying aggregate values in the response variable. The logarithm of the pseudo-marginal likelihood (LPML) is used for model comparison. In addition, in order to determine which self-thinning law is more supported by the meta-data, a measure called "Plausibility Index (PI)" is developed. A simulation study is conducted to examine empirical performance of the proposed methodology. Finally, the proposed model and the PI measure are applied to analyze a self-thinning meta-data set in details.
Variable selection for doubly robust causal inference
Cho E and Yang S
Confounding control is crucial and yet challenging for causal inference based on observational studies. Under the typical unconfoundness assumption, augmented inverse probability weighting (AIPW) has been popular for estimating the average causal effect (ACE) due to its double robustness in the sense it relies on either the propensity score model or the outcome mean model to be correctly specified. To ensure the key assumption holds, the effort is often made to collect a sufficiently rich set of pretreatment variables, rendering variable selection imperative. It is well known that variable selection for the propensity score targeted for accurate prediction may produce a variable ACE estimator by including the instrument variables. Thus, many recent works recommend selecting all outcome predictors for both confounding control and efficient estimation. This article shows that the AIPW estimator with variable selection targeted for efficient estimation may lose the desirable double robustness property. Instead, we propose controlling the propensity score model for any covariate that is a predictor of either the treatment or the outcome or both, which preserves the double robustness of the AIPW estimator. Using this principle, we propose a two-stage procedure with penalization for variable selection and the AIPW estimator for estimation. We show the proposed procedure benefits from the desirable double robustness property. We evaluate the finite-sample performance of the AIPW estimator with various variable selection criteria through simulation and an application.