MIXED MODELING APPROACH FOR CHARACTERIZING THE GENETIC EFFECTS IN A LONGITUDINAL PHENOTYPE
Approaches for estimating genetic effects at the individual level often focus on analyzing phenotypes at a single time point, with less attention given to longitudinal phenotypes. This paper introduces a mixed modeling approach that includes both genetic and individual-specific random effects, and is designed to estimate genetic effects on both the baseline and slope for a longitudinal trajectory. The inclusion of genetic effects on both baseline and slope, combined with the crossed structure of genetic and individual-specific random effects, creates complex dependencies across repeated measurements for all subjects. These complexities necessitate the development of novel estimation procedures for parameter estimation and individual-specific predictions of genetic effects on both baseline and slope. We employ an Average Information Restricted Maximum Likelihood (AI-ReML) algorithm to estimate the variance components corresponding to genetic and individual-specific effects for the baseline levels and rates of change for a longitudinal phenotype. The algorithm is used to characterizes the prostate-specific antigen (PSA) trajectories for participants who remained prostate cancer-free in the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial. Understanding genetic and individual-specific variation in this population will provide insights for determining the role of genetics in cancer screening. Our results reveal significant genetic contributions to both the initial PSA levels and their progression over time, highlighting the role of these genetic factors on the variability of PSA across unaffected individuals. We show how genetic factors can be used to identify individuals prone to large baseline and increasing trajectories PSA values among individuals who are prostate cancer-free. In turn, we can identify groups of individuals who have a high probability of falsely screening positive for prostate cancer using well established cutoffs for early detection based on the level and rate of change in this biomarker. The results demonstrate the importance of incorporating genetic factors for monitoring PSA for more accurate prostate cancer detection.
NETWORK-BASED MODELING OF EMOTIONAL EXPRESSIONS FOR MULTIPLE CANCERS VIA A LINGUISTIC ANALYSIS OF AN ONLINE HEALTH COMMUNITY
The diagnosis and treatment of cancer can evoke a variety of adverse emotions. Online health communities (OHCs) provide a safe platform for cancer patients and those closely related to express emotions without fear of judgement or stigma. In the literature, linguistic analysis of OHCs is usually limited to a single disease and based on methods with various technical limitations. In this article we analyze posts from September 2003 to September 2022 on eight cancers that are publicly available at the American Cancer Society's Cancer Survivors Network (CSN). We propose a novel network analysis technique based on low-rank matrices. The proposed approach decomposes the emotional expression semantic networks into an across-cancer time-independent component (which describes the "baseline" that is shared by multiple cancers), a cancer-specific time-independent component (which describes cancer-specific properties), and an across-cancer time-dependent component (which accommodates temporal effects on multiple cancer communities). For the second and third components, respectively, we consider a novel clustering structure and a change point structure. A penalization approach is developed, and its theoretical and computational properties are carefully established. The analysis of the CSN data leads to sensible networks and deeper insights into emotions for cancer overall and specific cancer types.
BAYESIAN LEARNING OF CLINICALLY MEANINGFUL SEPSIS PHENOTYPES IN NORTHERN TANZANIA
Sepsis is a life-threatening condition caused by a dysregulated host response to infection. Recently, researchers have hypothesized that sepsis consists of a heterogeneous spectrum of distinct subtypes, motivating several studies to identify clusters of sepsis patients that correspond to subtypes, with the long-term goal of using these clusters to design subtype-specific treatments. Therefore, clinicians rely on clusters having a concrete medical interpretation, usually corresponding to clinically meaningful regions of the sample space that have a concrete implication to practitioners. In this article, we propose Clustering Around Meaningful Regions (CLAMR), a Bayesian clustering approach that explicitly models the medical interpretation of each cluster center. CLAMR favors clusterings that can be summarized via meaningful feature values, leading to medically significant sepsis patient clusters. We also provide details on measuring the effect of each feature on the clustering using Bayesian hypothesis tests, so one can assess what features are relevant for cluster interpretation. Our focus is on clustering sepsis patients from Moshi, Tanzania, where patients are younger and the prevalence of HIV infection is higher than in previous sepsis subtyping cohorts.
BAYESIAN DIFFERENTIAL CAUSAL DIRECTED ACYCLIC GRAPHS FOR OBSERVATIONAL ZERO-INFLATED COUNTS WITH AN APPLICATION TO TWO-SAMPLE SINGLE-CELL DATA
Observational zero-inflated count data arise in a wide range of areas such as genomics. One of the common research questions is to identify causal relationships by learning the structure of a sparse directed acyclic graph (DAG). While structure learning of DAGs has been an active research area, existing methods do not adequately account for excessive zeros and therefore are not suitable for modeling zero-inflated count data. Moreover, it is often interesting to study differences in the causal networks for data collected from two experimental groups (control vs treatment). To explicitly account for zero-inflation and identify differential causal networks, we propose a novel Bayesian differential zero-inflated negative binomial DAG (DAG0) model. We prove that the causal relationships under the proposed DAG0 are fully identifiable from purely observational, cross-sectional data, using a general proof technique that is applicable beyond the proposed model. Bayesian inference based on parallel-tempered Markov chain Monte Carlo is developed to efficiently explore the multi-modal posterior landscape. We demonstrate the utility of the proposed DAG0 by comparing it with state-of-the-art alternative methods through extensive simulations. An application in a single-cell RNA-sequencing dataset generated under two experimental groups finds some interesting results that appear to be consistent with existing knowledge. A user-friendly R package that implements DAG0 is available at https://github.com/junsoukchoi/BayesDAG0.git.
SURROGATE SELECTION OVERSAMPLES EXPANDED T CELL CLONOTYPES
Surrogate selection is an experimental design that without sequencing any DNA can restrict a sample of cells to those carrying certain genomic mutations. In immunological disease studies, this design may provide a relatively easy approach to enrich a lymphocyte sample with cells relevant to the disease response because the emergence of neutral mutations associates with the proliferation history of clonal subpopulations. A statistical analysis of clonotype sizes provides a structured, quantitative perspective on this useful property of surrogate selection. Our model specification couples within-clonotype birth-death processes with an exchangeable model across clonotypes. Beyond enrichment questions about the surrogate selection design, our framework enables a study of sampling properties of elementary sample diversity statistics; it also points to new statistics that may usefully measure the burden of somatic genomic alterations associated with clonal expansion. We examine statistical properties of immunological samples governed by the coupled model specification, and we illustrate calculations in surrogate selection studies of melanoma and in single-cell genomic studies of T cell repertoires.
AVERAGED PREDICTION MODELS (APM): IDENTIFYING CAUSAL EFFECTS IN CONTROLLED PRE-POST SETTINGS WITH APPLICATION TO GUN POLICY
To investigate causal impacts, many researchers use controlled pre-post designs that compare over-time differences between a population exposed to a policy change and an unexposed comparison group. However, researchers using these designs often disagree about the "correct" specification of the causal model, perhaps most notably in analyses to identify the effects of gun policies on crime. To help settle these model specification debates, we propose a general identification framework that unifies a variety of models researchers use in practice. In this framework, which nests "brand name" designs like Difference-in-Differences as special cases, we use models to predict untreated outcomes and then correct the treated group's predictions using the comparison group's observed prediction errors. Our point identifying assumption is that treated and comparison groups would have equal prediction errors (in expectation) under no treatment. To choose among candidate models, we propose a data-driven procedure based on models' robustness to violations of this point identifying assumption. Our selection procedure averages over candidate models, weighting by each model's posterior probability of being the most robust given its differential average prediction errors in the pre-period. This approach offers a way out of debates over the "correct" model by choosing on robustness instead and has the desirable property of being feasible in the "locked box" of pre-intervention data only. We apply our methodology to the gun policy debate, focusing specifically on Missouri's 2007 repeal of its permit-to-purchase law, and provide an R package (apm) for implementation.
A DEEP NEURAL NETWORK TWO-PART MODEL AND FEATURE IMPORTANCE TEST FOR SEMI-CONTINUOUS DATA
Semi-continuous data frequently arise in clinical practice. For example, while many surgical patients still suffer from varying degrees of acute postoperative pain (POP) sometime after surgery (i.e., POP score > 0), others experience none (i.e., POP score = 0), indicating the existence of two distinct data processes at play. Existing parametric or semi-parametric two-part modeling methods for this type of semi-continuous data can fail to appropriately model the two underlying data processes as such methods rely heavily on (generalized) linear additive assumptions. However, many factors may interact to jointly influence the experience of POP non-additively and non-linearly. Motivated by this challenge and inspired by the flexibility of deep neural networks (DNN) to accurately approximate complex functions universally, we derive a DNN-based two-part model by adapting the conventional DNN methods with two additional components: a bootstrapping procedure along with a filtering algorithm to boost the stability of the conventional DNN, an approach we denote as sDNN. To improve the interpretability and transparency of sDNN, we further derive a feature importance testing procedure to identify important features associated with the outcome measurements of the two data processes, denoting this approach fsDNN. We show that fsDNN not only offers a statistical inference procedure for each feature under complex association but also that using the identified features can further improve the predictive performance of sDNN. The proposed sDNN- and fsDNN-based two-part models are applied to the analysis of real data from a POP study, in which application they clearly demonstrate advantages over the existing parametric and semi-parametric two-part models. Further, we conduct extensive numerical studies and draw comparisons with other machine learning methods to demonstrate that sDNN and fsDNN consistently outperform the existing two-part models and frequently used machine learning methods regardless of the data complexity. An R package implementing the proposed methods has been developed and is available in the Supplementary Material (Zou et al, 2025) and is also deposited on GitHub (https://github.com/BZou-lab/fsDNN).
BAYESIAN DATA AUGMENTATION FOR RECURRENT EVENTS UNDER INTERMITTENT ASSESSMENT IN OVERLAPPING INTERVALS WITH APPLICATIONS TO EMR DATA
Electronic medical records (EMR) data contain rich information that can facilitate health-related studies but is collected primarily for purposes other than research. For recurrent events, EMR data often do not record event times or counts but only contain intermittently assessed and censored observations (i.e. upper and/or lower bounds for counts in a time interval) at uncontrolled times. This can result in non-contiguous or overlapping assessment intervals with censored event counts. Existing methods for analyzing intermittently assessed recurrent events assume disjoint assessment intervals with known counts (interval count data) due to a focus on prospective studies with controlled assessment times. We propose a Bayesian data augmentation method to analyze the complicated assessments in EMR data for recurrent events. Within a Gibbs sampler, event times are imputed by generating sets of event times from non-homogeneous Poisson processes and rejecting proposed sets that are incompatible with constraints imposed by assessment data. Based on the independent increments property of Poisson processes, we implement three techniques to speed up this rejection sampling imputation method for large EMR datasets: independent sampling by partitioning, truncated generation, and sequential sampling. In a simulation study we show our method accurately estimates parameters of log-linear Poisson process intensities. Although the proposed method can be applied generally to EMR data of recurrent events, our study is specifically motivated by identifying risk factors for falls due to cancer treatment and its supportive medications. We used the proposed method to analyze an EMR dataset comprising 5501 patients treated for breast cancer. Our analysis provides evidence supporting associations between certain risk factors (including classes of medications) and risk of falls.
HETEROGENEOUS TREATMENT AND SPILLOVER EFFECTS UNDER CLUSTERED NETWORK INTERFERENCE
The bulk of causal inference studies rule out the presence of interference between units. However, in many real-world scenarios, units are interconnected by social, physical, or virtual ties, and the effect of the treatment can spill from one unit to other connected individuals in the network. In this paper, we develop a machine learning method that uses tree-based algorithms and a Horvitz-Thompson estimator to assess the heterogeneity of treatment and spillover effects with respect to individual, neighborhood, and network characteristics in the context of clustered networks and interference within clusters. The proposed network causal tree (NCT) algorithm has several advantages. First, it allows the investigation of the heterogeneity of the treatment effect, avoiding potential bias due to the presence of interference. Second, understanding the heterogeneity of both treatment and spillover effects can guide policymakers in scaling up interventions, designing targeting strategies, and increasing cost-effectiveness. We investigate the performance of our NCT method using a Monte Carlo simulation study and illustrate its application to assess the heterogeneous effects of information sessions on the uptake of a new weather insurance policy in rural China.
DYNAMIC PREDICTION WITH MULTIVARIATE LONGITUDINAL OUTCOMES AND LONGITUDINAL MAGNETIC RESONANCE IMAGING DATA
Alzheimer's Disease (AD) is a common neurodegenerative disorder impairing multiple domains. Recent AD studies, for example, the Alzheimer's Disease Neuroimaging Initiative (ADNI) study, collect multimodal data to better understand AD severity and progression. To facilitate precision medicine for high-risk individuals, it is essential to develop an AD predictive model that leverages multimodal data and provides accurate personalized predictions of dementia occurrences. In this article we propose a multivariate functional mixed model with longitudinal magnetic resonance imaging data (MFMM-LMRI) that jointly models longitudinal neurological scores, longitudinal voxelwise MRI data, and the survival outcome as dementia onset. We model longitudinal MRI data using the joint and individual variation explained (JIVE) approach. We investigate two functional forms linking the longitudinal and survival processes. We adopt the Markov chain Monte Carlo (MCMC) method to obtain posterior samples. We establish a dynamic prediction framework that predicts longitudinal trajectories and the probability of dementia occurrence. The simulation study with various sample sizes and event rates supports the validity of the method. We apply the MFMM-LMRI to the motivating ADNI study and conclude that additional ApoE-4 alleles and a higher latent disease profile are associated with a higher risk of dementia onset. We detect a significant association between the longitudinal MRI data and the survival outcome. The instantaneous model with longitudinal MRI data has the best fitting and predictive performance.
INFERRING SYNERGISTIC AND ANTAGONISTIC INTERACTIONS IN MIXTURES OF EXPOSURES
There is abundant interest in assessing the joint effects of multiple exposures on human health. This is often referred to as the mixtures problem in environmental epidemiology and toxicology. Classically, studies have examined the adverse health effects of different chemicals one at a time, but there is concern that certain chemicals may act together to amplify each other's effects. Such amplification is referred to as interaction, while chemicals that inhibit each other's effects have interactions. Current approaches for assessing the health effects of chemical mixtures do not explicitly consider synergy or antagonism in the modeling, instead focusing on either parametric or unconstrained nonparametric dose response surface modeling. The parametric case can be too inflexible, while nonparametric methods face a curse of dimensionality that leads to overly wiggly and uninterpretable surface estimates. We propose a Bayesian approach that decomposes the response surface into additive main effects and pairwise interaction effects and then detects synergistic and antagonistic interactions. Variable selection decisions for each interaction component are also provided. This Synergistic Antagonistic Interaction Detection (SAID) framework is evaluated relative to existing approaches using simulation experiments and an application to data from NHANES.
LOW-RANK LONGITUDINAL FACTOR REGRESSION WITH APPLICATION TO CHEMICAL MIXTURES
Developmental epidemiology commonly focuses on assessing the association between multiple early life exposures and childhood health. Statistical analyses of data from such studies focus on inferring the contributions of individual exposures, while also characterizing time-varying and interacting effects. Such inferences are made more challenging by correlations among exposures, nonlinearity, and the curse of dimensionality. Motivated by studying the effects of prenatal bisphenol A (BPA) and phthalate exposures on glucose metabolism in adolescence using data from the ELEMENT study, we propose a low-rank longitudinal factor regression (LowFR) model for tractable inference on flexible longitudinal exposure effects. LowFR handles highly-correlated exposures using a Bayesian dynamic factor model, which is fit jointly with a health outcome via a novel factor regression approach. The model collapses on simpler and intuitive submodels when appropriate, while expanding to allow considerable flexibility in time-varying and interaction effects when supported by the data. After demonstrating LowFR's effectiveness in simulations, we use it to analyze the ELEMENT data and find that diethyl and dibutyl phthalate metabolite levels in trimesters 1 and 2 are associated with altered glucose metabolism in adolescence.
UTILIZING A CAPTURE-RECAPTURE STRATEGY TO ACCELERATE INFECTIOUS DISEASE SURVEILLANCE
Monitoring key elements of disease dynamics (e.g., prevalence, case counts) is of great importance in infectious disease prevention and control, as emphasized during the COVID-19 pandemic. To facilitate this effort, we propose a new capture-recapture (CRC) analysis strategy that adjusts for misclassification stemming from the use of easily administered but imperfect diagnostic test kits, such as rapid antigen test-kits or saliva tests. Our method is based on a recently proposed "anchor stream" design, whereby an existing voluntary surveillance data stream is augmented by a smaller and judiciously drawn random sample. It incorporates manufacturer-specified sensitivity and specificity parameters to account for imperfect diagnostic results in one or both data streams. For inference to accompany case count estimation, we improve upon traditional Wald-type confidence intervals by developing an adapted Bayesian credible interval for the CRC estimator that yields favorable frequentist coverage properties. When feasible, the proposed design and analytic strategy provides a more efficient solution than traditional CRC methods or random sampling-based bias-corrected estimation to monitor disease prevalence while accounting for misclassification. We demonstrate the benefits of this approach through simulation studies and a numerical example that underscore its potential utility in practice for economical disease monitoring among a registered closed population.
A LATENT VARIABLE MIXTURE MODEL FOR COMPOSITION-ON-COMPOSITION REGRESSION WITH APPLICATION TO CHEMICAL RECYCLING
It is quite common to encounter compositional data in a regression framework in data analysis. When both responses and predictors are compositional, most existing models rely on a family of log-ratio based transformations to move the analysis from the simplex to the reals. This often makes the interpretation of the model more complex. A transformation-free regression model was recently developed, but it only allows for a single compositional predictor. However, many datasets include multiple compositional predictors of interest. Motivated by an application to hydrothermal liquefaction (HTL) data, a novel extension of this transformation-free regression model is provided that allows for two (or more) compositional predictors to be used via a latent variable mixture. A modified expectation-maximization algorithm is proposed to estimate model parameters, which are shown to have natural interpretations. Conformal inference is used to obtain prediction limits on the compositional response. The resulting methodology is applied to the HTL dataset. Extensions to multiple predictors are discussed.
STATISTICAL CURVE MODELS FOR INFERRING 3D CHROMATIN ARCHITECTURE
Reconstructing three-dimensional (3D) chromatin structure from conformation capture assays (such as Hi-C) is a critical task in computational biology, since chromatin spatial architecture plays a vital role in numerous cellular processes and direct imaging is challenging. Most existing algorithms that operate on Hi-C contact matrices produce reconstructed 3D configurations in the form of a polygonal chain. However, none of the methods exploit the fact that the target solution is a (smooth) curve in 3D: this contiguity attribute is either ignored or indirectly addressed by imposing spatial constraints that are challenging to formulate. In this paper we develop both B-spline and smoothing spline techniques for directly capturing this potentially complex 1D curve. We subsequently combine these techniques with a Poisson model for contact counts and compare their performance on a real data example. In addition, motivated by the sparsity of Hi-C contact data, especially when obtained from single-cell assays, we appreciably extend the class of distributions used to model contact counts. We build a general distribution-based metric scaling ( ) framework from which we develop zero-inflated and Hurdle Poisson models as well as negative binomial applications. Illustrative applications make recourse to bulk Hi-C data from IMR90 cells and single-cell Hi-C data from mouse embryonic stem cells.
INDIVIDUAL DYNAMIC PREDICTION FOR CURE AND SURVIVAL BASED ON LONGITUDINAL BIOMARKERS
To optimize personalized treatment strategies and extend patients' survival times, it is critical to accurately predict patients' prognoses at all stages, from disease diagnosis to follow-up visits. The longitudinal biomarker measurements during visits are essential for this prediction purpose. Patients' ultimate concerns are cure and survival. However, in many situations, there is no clear biomarker indicator for cure. We propose a comprehensive joint model of longitudinal and survival data and a landmark cure model, incorporating proportions of potentially cured patients. The survival distributions in the joint and landmark models are specified through flexible hazard functions with the proportional hazards as a special case, allowing other patterns such as crossing hazard and survival functions. Formulas are provided for predicting each individual's probabilities of future cure and survival at any time point based on his or her current biomarker history. Simulations show that, with these comprehensive and flexible properties, the proposed cure models outperform standard cure models in terms of predictive performance, measured by the time-dependent area under the curve of receiver operating characteristic, Brier score, and integrated Brier score. The use and advantages of the proposed models are illustrated by their application to a study of patients with chronic myeloid leukemia.
A SEMIPARAMETRIC METHOD FOR RISK PREDICTION USING INTEGRATED ELECTRONIC HEALTH RECORD DATA
When using electronic health records (EHRs) for clinical and translational research, additional data is often available from external sources to enrich the information extracted from EHRs. For example, academic biobanks have more granular data available, and patient reported data is often collected through small-scale surveys. It is common that the external data is available only for a small subset of patients who have EHR information. We propose efficient and robust methods for building and evaluating models for predicting the risk of binary outcomes using such integrated EHR data. Our method is built upon an idea derived from the two-phase design literature that modeling the availability of a patient's external data as a function of an EHR-based preliminary predictive score leads to effective utilization of the EHR data. Through both theoretical and simulation studies, we show that our method has high efficiency for estimating log-odds ratio parameters, the area under the ROC curve, as well as other measures for quantifying predictive accuracy. We apply our method to develop a model for predicting the short-term mortality risk of oncology patients, where the data was extracted from the University of Pennsylvania hospital system EHR and combined with survey-based patient reported outcome data.
MODELING TRAJECTORIES USING FUNCTIONAL LINEAR DIFFERENTIAL EQUATIONS
We are motivated by a study that seeks to better understand the dynamic relationship between muscle activation and paw position during locomotion. For each gait cycle in this experiment, activation in the biceps and triceps is measured continuously and in parallel with paw position as a mouse trotted on a treadmill. We propose an innovative general regression method that draws from both ordinary differential equations and functional data analysis to model the relationship between these functional inputs and responses as a dynamical system that evolves over time. Specifically, our model addresses gaps in both literatures and borrows strength across curves estimating ODE parameters across all curves simultaneously rather than separately modeling each functional observation. Our approach compares favorably to related functional data methods in simulations and in cross-validated predictive accuracy of paw position in the gait data. In the analysis of the gait cycles, we find that paw speed and position are dynamically influenced by inputs from the biceps and triceps muscles and that the effect of muscle activation persists beyond the activation itself.
OUTCOME-GUIDED DISEASE SUBTYPING BY GENERATIVE MODEL AND WEIGHTED JOINT LIKELIHOOD IN TRANSCRIPTOMIC APPLICATIONS
With advances in high-throughput technology, molecular disease subtyping by high-dimensional omics data has been recognized as an effective approach for identifying subtypes of complex diseases with distinct disease mechanisms and prognoses. Conventional cluster analysis takes omics data as input and generates patient clusters with similar gene expression pattern. The omics data, however, usually contain multi-faceted cluster structures that can be defined by different sets of gene. If the gene set associated with irrelevant clinical variables (e.g., sex or age) dominates the clustering process, the resulting clusters may not capture clinically meaningful disease subtypes. This motivates the development of a clustering framework with guidance from a pre-specified disease outcome, such as lung function measurement or survival, in this paper. We propose two disease subtyping methods by omics data with outcome guidance using a generative model or a weighted joint likelihood. Both methods connect an outcome association model and a disease subtyping model by a latent variable of cluster labels. Compared to the generative model, weighted joint likelihood contains a data-driven weight parameter to balance the likelihood contributions from outcome association and gene cluster separation, which improves generalizability in independent validation but requires heavier computing. Extensive simulations and two real applications in lung disease and triple-negative breast cancer demonstrate superior disease subtyping performance of the outcome-guided clustering methods in terms of disease subtyping accuracy, gene selection and outcome association. Unlike existing clustering methods, the outcome-guided disease subtyping framework creates a new precision medicine paradigm to directly identify patient subgroups with clinical association.
QUANTILE REGRESSION DECOMPOSITION ANALYSIS OF DISPARITY RESEARCH USING COMPLEX SURVEY DATA: APPLICATION TO DISPARITIES IN BMI AND TELOMERE LENGTH BETWEEN U.S. MINORITY AND WHITE POPULATION GROUPS
We develop a quantile regression decomposition (QRD) method for analyzing observed disparities (OD) between population groups in socioeconomic and health-related outcomes for complex survey data. The conventional decomposition approaches use the conditional mean regression to decompose the disparity into two parts, the part explained by the difference arising from the different distributions in the explanatory covariates and the remaining part, which is unexplained by the covariates. Many socioeconomic and health outcomes exhibit heteroscedastic distributions, where the magnitude of observed disparities varies across different quantiles of these outcomes. Thus, differences in the explanatory covariates may account for varying differences in the OD across the quantiles of the outcome. The QRD can identify where there are greater differences in the outcome distribution, for example, 90th quantile, and how important the covariates are in explaining those differences. Much socioeconomic and health research relies on complex surveys, such as the National Health and Nutrition Examination Survey (NHANES), that oversample individuals from disadvantaged/minority population groups in order to provide improved precision. QRD has not been extended to the complex survey setting. We improve the QRD approach proposed in Machado and Mata (2005) to yield more reliable estimates at the quantiles, where the data are sparse, and extend it to the complex survey setting. We also propose a perturbation-based variance estimation method. Simulation studies indicate that the estimates of the unexplained portions of the OD across quantiles are unbiased and the coverage of the confidence intervals are close to nominal value. This methodology is used to study disparities in body mass index (BMI) and telomere length between race/ethnic groups estimated from the NHANES data.
SEMIPARAMETRIC LINEAR REGRESSION WITH AN INTERVAL-CENSORED COVARIATE IN THE ATHEROSCLEROSIS RISK IN COMMUNITIES STUDY
In longitudinal studies, investigators are often interested in understanding how the time since the occurrence of an intermediate event affects a future outcome. The intermediate event is often asymptomatic such that its occurrence is only known to lie in a time interval induced by periodic examinations. We propose a linear regression model that relates the time since the occurrence of the intermediate event to a continuous response at a future time point through a rectified linear unit activation function while formulating the distribution of the time to the occurrence of the intermediate event through the Cox proportional hazards model. We consider nonparametric maximum likelihood estimation with an arbitrary sequence of examination times for each subject. We present an EM algorithm that converges stably for arbitrary datasets. The resulting estimators of regression parameters are consistent, asymptotically normal, and asymptotically efficient. We assess the performance of the proposed methods through extensive simulation studies and provide an application to the Atherosclerosis Risk in Communities Study.
