Bayesian variable selection for logistic regression with a differentially misclassified binary covariate
A Bayesian approach for variable selection is developed for use in models with a misclassified binary predictor variable. We define the main outcome model containing the latent predictor, the measurement model associated with the prevalence of the predictor, and the sensitivity and specificity models of the fallible classifier conditioned on the true value of the predictor. We use binary indicator variables to execute the Gibbs sampler-based variable selection process, and we identify the highest posterior probability model given the data. We demonstrate the performance of the procedure in several simulation studies, and we utilize the selection method to optimize model performance in two datasets.
Statistical methods for assessing treatment effects on ordinal outcomes using observational data
In this article, we propose a marginal structural ordinal logistic regression model (MS-OLRM) to assess treatment effects on ordinal outcomes. Many statistical methods have been developed to estimate average treatment effect (ATE) when the outcome is continuous or binary. The methodology for assessing the effect of treatment for an ordinal outcome is less studied. To address this, we propose utilizing a superiority score as a measure of treatment effect, assessing whether the outcome under treatment is stochastically larger than the outcome under control. Our approach involves employing MS-OLRM in conjunction with Inverse Probability of Treatment Weighting (IPTW) to estimate the superiority score under treatment compared to the control. This methodology adjusts for confounding factors between treatment and outcome by utilizing IPTW, ensuring that all covariates are balanced among different treatment groups in the weighted sample. To assess the performance of the proposed method, we conduct extensive simulation studies. Finally, we apply the developed method to assess the treatment effects of medications and behavioral therapies on patients' recovery from alcohol use disorders using the Kentucky Medicaid 2012-2019 database.
Sampling Spiked Wishart Eigenvalues
Efficient schemes for sampling from the eigenvalues of the Wishart distribution have recently been described for both the standard Wishart case (where the covariance matrix is the identity) and the spiked Wishart with a single spike (where the covariance matrix differs from the identity in a single entry on the diagonal). Here, we generalize these schemes to the spiked Wishart with an arbitrary number of spikes. This approach also applies to the spiked pseudo-Wishart distribution. We describe how to differentiate this procedure for the purposes of stochastic gradient descent, allowing the fitting of the eigenvalue distribution to some target distribution.
Automated Parameter Selection in Singular Spectrum Analysis for Time Series Analysis
In spite of wide applications of the singular spectrum analysis (SSA) method, understanding how SSA reconstructs time series and eliminates noise remains challenging due to its complex process. This study provided a novel geometric perspective to elucidate the underlying mechanism of SSA. To address the key issue of conventional SSA that requires a fixed window length and a given threshold for determining the number of groups, we proposed a sequential reconstruction approach that averages reconstructed series from various window lengths with a stopping rule based on a symmetric test. Three main advantages of the proposed method were demonstrated by the simulations and real data analysis of 7-day heart rate data from an e-cigarette user: 1) requiring no prior knowledge of the window length or group number; 2) yielding smaller values of root mean square error (RMSE) than the conventional SSA; and 3) revealing both local features and sudden changes related to events of interest. While conventional SSA excels in extracting stable signal structures, the proposed method is tailored for time series with varying structures such as heart rate data from smartwatches, and thus will have even wider applications.
BayCAR: A Bayesian based Covariate-Adaptive Randomization method for multi-arm trials
Randomization is an essential component of a successful controlled clinical trial. Many randomization methods have been developed to balance the distributions of covariates across treatment arms to remove potential confounding effects. While the restricted randomization methods would not work well if the number of covariates is large, the theoretical base of the minimization methods needs more justifications. We propose a Bayesian covariate-adaptive randomization method that not only has meaningful interpretations on its adaptive randomization probability, but also achieves desirable marginal and overall balances for both categorical and continuous covariates, particularly when balancing a large number of covariates is necessary.
Likelihood-Based Inference for Semi-Parametric Transformation Cure Models with Interval Censored Data
A simple yet effective way of modeling survival data with cure fraction is by considering Box-Cox transformation cure model (BCTM) that unifies mixture and promotion time cure models. In this article, we numerically study the statistical properties of the BCTM when applied to interval censored data. Time-to-events associated with susceptible subjects are modeled through proportional hazards structure that allows for non-homogeneity across subjects, where the baseline hazard function is estimated by distribution-free piecewise linear function with varied degrees of non-parametricity. Due to missing cured statuses for right censored subjects, maximum likelihood estimates of model parameters are obtained by developing an expectation-maximization (EM) algorithm. Under the EM framework, the conditional expectation of the complete data log-likelihood function is maximized by considering all parameters (including the Box-Cox transformation parameter ) simultaneously, in contrast to conventional profile-likelihood technique of estimating . The robustness and accuracy of the model and estimation method are established through a detailed simulation study under various parameter settings, and an analysis of real-life data obtained from a smoking cessation study.
A New Cure Rate Model with Discrete and Multiple Exposures
Cure rate models are mostly used to study data arising from cancer clinical trials. Its use in the context of infectious diseases has not been explored well. In 2008, Tournoud and Ecochard first proposed a mechanistic formulation of cure rate model in the context of infectious diseases with multiple exposures to infection. However, they assumed a simple Poisson distribution to capture the unobserved pathogens at each exposure time. In this paper, we propose a new cure rate model to study infectious diseases with discrete multiple exposures to infection. Our formulation captures both over-dispersion and under-dispersion with respect to the count on pathogens at each time of exposure. We also propose a new estimation method based on the expectation maximization algorithm to calculate the maximum likelihood estimates of the model parameters. We carry out a detailed Monte Carlo simulation study to demonstrate the performance of the proposed model and estimation algorithm. The flexibility of our proposed model also allows us to carry out a model discrimination. For this purpose, we use both likelihood ratio test and information-based criteria. Finally, we illustrate our proposed model using a recently collected data on COVID-19.
Using biomarkers to allocate patients in a response-adaptive clinical trial
In this paper, we discuss a response adaptive randomization method, and why it should be used in clinical trials for rare diseases compared to a randomized controlled trial with equal fixed randomization. The developed method uses a patient's biomarkers to alter the allocation probability to each treatment, in order to emphasize the benefit to the trial population. The method starts with an initial burn-in period of a small number of patients, who with equal probability, are allocated to each treatment. We then use a regression method to predict the best outcome of the next patient, using their biomarkers and the information from the previous patients. This estimated best treatment is assigned to the next patient with high probability. A completed clinical trial for the effect of catumaxomab on the survival of cancer patients is used as an example to demonstrate the use of the method and the differences to a controlled trial with equal allocation. Different regression procedures are investigated and compared to a randomized controlled trial, using efficacy and ethical measures.
Robust Estimation of Heterogeneous Treatment Effects: An Algorithm-based Approach
Heterogeneous treatment effect estimation is an essential element in the practice of tailoring treatment to suit the characteristics of individual patients. Most existing methods are not sufficiently robust against data irregularities. To enhance the robustness of the existing methods, we recently put forward a general estimating equation that unifies many existing learners. But the performance of model-based learners depends heavily on the correctness of the underlying treatment effect model. This paper addresses this vulnerability by converting the treatment effect estimation to a weighted supervised learning problem. We combine the general estimating equation with supervised learning algorithms, such as the gradient boosting machine, random forest, and artificial neural network, with appropriate modifications. This extension retains the estimators' robustness while enhancing their flexibility and scalability. Simulation shows that the algorithm-based estimation methods outperform their model-based counterparts in the presence of nonlinearity and non-additivity. We developed an package, , for public access to the proposed methods. To illustrate the methods, we present a real data example to compare the blood pressure-lowering effects of two classes of antihypertensive agents.
Causal inference with a mediated proportional hazards regression model
The natural direct and indirect effects in causal mediation analysis with survival data having one mediator is addressed by VanderWeele (2011) [1]. He derived an approach for (1) an accelerated failure time regression model in general cases and (2) a proportional hazards regression model when the time-to-event outcome is rare. If the outcome is not rare, then VanderWeele (2011) [1] did not derive a simple closed-form expression for the log-natural direct and log-natural indirect effects for the proportional hazards regression model because the baseline cumulative hazard function does not approach zero. We develop two approaches to extend VanderWeele's approach, in which the assumption of a rare outcome is not required. We obtain the natural direct and indirect effects for specific time points through numerical integration after we calculate the cumulative baseline hazard by (1) applying the Breslow method in the Cox proportional hazards regression model to estimate the unspecified cumulative baseline hazard; (2) assuming a piecewise constant baseline hazard model, yielding a parametric model, to estimate the baseline hazard and cumulative baseline hazard. We conduct simulation studies to compare our two approaches with other methods and illustrate our two approaches by applying them to data from the ASsessment, Serial Evaluation, and Subsequent Sequelae in Acute Kidney Injury (ASSESS-AKI) Consortium.
A weighted Jackknife approach utilizing linear model based-estimators for clustered data
Small number of clusters combined with cluster level heterogeneity poses a great challenge for the data analysis. We have published a weighted Jackknife approach to address this issue applying weighted cluster means as the basic estimators. The current study proposes a new version of the weighted delete-one-cluster Jackknife analytic framework, which employs Ordinary Least Squares or Generalized Least Squares estimators as the fundamentals. Algorithms for computing estimated variances of the study estimators have also been derived. Wald test statistics can be further obtained, and the statistical comparison in the outcome means of two conditions is determined using the cluster permutation procedure. The simulation studies show that the proposed framework produces estimates with higher precision and improved power for statistical hypothesis testing compared to other methods.
Robust RNA-seq data analysis using an integrated method of ROC curve and Kolmogorov-Smirnov test
It is a common approach to dichotomize a continuous biomarker in clinical setting for the convenience of application. Analytically, results from using a dichotomized biomarker are often more reliable and resistant to outliers, bi-modal and other unknown distributions. There are two commonly used methods for selecting the best cut-off value for dichotomization of a continuous biomarker, using either maximally selected chi-square statistic or a ROC curve, specifically the Youden Index. In this paper, we explained that in many situations, it is inappropriate to use the former. By using the Maximum Absolute Youden Index (MAYI), we demonstrated that the integration of a MAYI and the Kolmogorov-Smirnov test is not only a robust non-parametric method, but also provides more meaningful p value for selecting the cut-off value than using a Mann-Whitney test. In addition, our method can be applied directly in clinical settings.
Sensitivity analysis for assumptions of general mediation analysis
Mediation analysis is widely used to identify significant mediators and estimate the mediation (direct and indirect) effects in causal pathways between an exposure variable and a response variable. In mediation analysis, the mediation effect refers to the effect transmitted by mediator intervening the relationship between an exposure variable and a response variable. Traditional mediation analysis methods, such as the difference in the coefficient method, the product of the coefficient method, and counterfactual framework method, all require several key assumptions. Thus, the estimation of mediation effects can be biased when one or more assumptions are violated. In addition to the traditional mediation analysis methods, Yu et al. proposed a general mediation analysis method that can use general predictive models to estimate mediation effects of any types of exposure variable(s), mediators and outcome(s). However, whether this method relies on the assumptions for the traditional mediation analysis methods is unknown. In this paper, we perform series of simulation studies to investigate the impact of violation of assumptions on the estimation of mediation effects using Yu et al.'s mediation analysis method. We use the R package for all estimations. We find that three assumptions for traditional mediation analysis methods are also essential for Yu et al.'s method. This paper provides a pipeline for using simulations to evaluate the impact of the assumptions for the general mediation analysis.
PICBayes: Bayesian proportional hazards models for partly interval-censored data
Partly interval-censored (PIC) data arise frequently in medical studies of diseases that require periodic examinations for symptoms of interest, such as progression-free survival and relapse-free survival. Proportional hazards (PH) model is the most widely used model in survival analysis. This paper introduces our new R package which implements a set of functions for fitting the PH model to different complexities of partly interval-censored data under the Bayesian semiparametric framework. The main function PICBayes fits: (1) PH model to PIC data; (2) PH model with spatial frailty to areally-referenced PIC data; (3) PH model with one random intercept to clustered PIC data; (4) PH model with one random intercept and one random effect to clustered PIC data; (5) general mixed effects PH model to clustered PIC data. We also included the corresponding functions for general interval-censored data. A random intercept/random effect can follow both a normal prior and a Dirichlet process mixture prior. The use of the package is illustrated through analyzing two real data sets.
The Role of Weighting Adjustment for Attrition in Longitudinal Trajectory Modeling: A Simulation Study
Most longitudinal surveys construct weights and release wave-specific weights to adjust for attrition. However, there is no clear consensus in the literature on whether or how to apply weights in longitudinal trajectory modeling. We present a simulation study, motivated by a real-life longitudinal study of substance use, and consider different missing data mechanisms, weight construction processes, and specifications of substantive models of interest. Based on the results of the simulation study, we provide practical recommendations for analysts of longitudinal survey data with respect to weighting approaches that should be considered in alternative scenarios.
Diagnostics for a two-stage joint survival model
A two-stage joint survival model is used to analyse time to event outcomes that could be associated with biomakers that are repeatedly collected over time. A Two-stage joint survival model has limited model checking tools and is usually assessed using standard diagnostic tools for survival models. The diagnostic tools can be improved and implemented. Time-varying covariates in a two-stage joint survival model might contain outlying observations or subjects. In this study we used the variance shift outlier model (VSOM) to detect and down-weight outliers in the first stage of the two-stage joint survival model. This entails fitting a VSOM at the observation level and a VSOM at the subject level, and then fitting a combined VSOM for the identified outliers. The fitted values were then extracted from the combined VSOM which were then used as time-varying covariate in the extended Cox model. We illustrate this methodology on a dataset from a multi-centre randomised clinical trial. A multi-centre trial showed that a combined VSOM fits the data better than an extended Cox model. We noted that implementing a combined VSOM, when desired, has a better fit based on the fact that outliers are down-weighted.
Adjusted curves for clustered survival and competing risks data
Observational studies with right-censored data often have clustered data due to matched pairs or a study center effect. In such data, there may be an imbalance in patient characteristics between treatment groups, where Kaplan-Meier curves or unadjusted cumulative incidence curves can be misleading and may not represent the average patient on a given treatment arm. Adjusted curves are desirable to appropriately display survival or cumulative incidence curves in this case. We propose methods for estimating the adjusted survival and cumulative incidence probabilities for clustered right-censored data. For the competing risks outcome, we allow both covariate-independent and covariate-dependent censoring. We develop an R package to implement the proposed methods. It provides the estimates of adjusted survival and cumulative incidence probabilities along with their standard errors. Our simulation results show that the adjusted survival and cumulative incidence estimates of the proposed method are unbiased with approximate 95% coverage rates. We apply the proposed method to stem cell transplant data of leukemia patients.
Optimal Personalized Treatment Selection with Multivariate Outcome Measures in a Multiple Treatment Case
In this work we propose a novel method for individualized treatment selection when there are correlated multiple treatment responses. For the treatment ( ≥ 2) scenario, we compare quantities that are suitable indexes based on outcome variables for each treatment conditional on patient-specific scores constructed from collected covariate measurements. Our method covers any number of treatments and outcome variables, and it can be applied for a broad set of models. The proposed method uses a rank aggregation technique that takes into account possible correlations among ranked lists to estimate an ordering of treatments based on treatment performance measures such as the smooth conditional mean. The method has the flexibility to incorporate patient and clinician preferences into the optimal treatment decision on an individual case basis. A simulation study demonstrates the performance of the proposed method in finite samples. We also present data analyses using HIV clinical trial data to show the applicability of the proposed procedure for real data.
Analysis of combined probability and nonprobability samples: A simulation evaluation and application to a teen smoking behavior survey
In scientific studies with low-prevalence outcomes, probability sampling may be supplemented by nonprobability sampling to boost the sample size of desired subpopulation while remaining representative to the entire study population. To utilize both probability and nonprobability samples appropriately, several methods have been proposed in the literature to generate pseudo-weights, including ad-hoc weights, inclusion probability adjusted weights, and propensity score adjusted weights. We empirically compare various weighting strategies via an extensive simulation study, where probability and nonprobability samples are combined. Weight normalization and raking adjustment are also considered. Our simulation results suggest that the unity weight method (with weight normalization) and the inclusion probability adjusted weight method yield very good overall performance. This work is motivated by the Buckeye Teen Health Study, which examines risk factors for the initiation of smoking among teenage males in Ohio. To address the low response rate in the initial probability sample and low prevalence of smokers in the target population, a small convenience sample was collected as a supplement. Our proposed method yields estimates very close to the ones from the analysis using only the probability sample and enjoys the additional benefit of being able to track more teens with risky behaviors through follow-ups.
A note on the estimation and inference with quadratic inference functions for correlated outcomes
The quadratic inference function approach is a popular method in the analysis of correlated data. The quadratic inference function is formulated based on multiple sets of score equations (or extended score equations) that over-identify the regression parameters of interest, and improves efficiency over the generalized estimating equations under correlation misspecification. In this note, we provide an alternative solution to the quadratic inference function by separately solving each set of score equations and combining the solutions. We provide an insight that an optimally weighted combination of estimators obtained separately from the distinct sets of score equations is asymptotically equivalent to the estimator obtained via the quadratic inference function. We further establish results on inference for the optimally weighted estimator and extend these insights to the general setting with over-identified estimating equations. A simulation study is carried out to confirm the analytical insights and connections in finite samples.
The generalized sigmoidal quantile function
In this note we introduce a new smooth nonparametric quantile function estimator based on a newly defined generalized expectile function and termed the sigmoidal quantile function estimator. We also introduce a hybrid quantile function estimator, which combines the optimal properties of the classic kernel quantile function estimator with our new generalized sigmoidal quantile function estimator. The generalized sigmoidal quantile function can estimate quantiles beyond the range of the data, which is important for certain applications given smaller sample sizes. This property of extrapolation is illustrated in order to improve standard bootstrap smoothing resampling methods.
