Calibrated Dietary Patterns and Cancer Risk in the Women's Health Initiative Cohorts
We developed calibration equations using metabolomics from fasting blood and 24-hour urine for Healthy Eating Index 2010 (HEI-2010) and Alternative Healthy Eating Index 2010 (AHEI-2010) to address measurement error from self-reported diet. We examined associations between metabolomic-calibrated dietary patterns and cancer risk in the Women's Health Initiative (WHI, n=108,522). Metabolomic signatures were created from a WHI Feeding (n=153;2010-2014) and WHI Observational Study (n=450;2006-2009). Dietary patterns were regressed on metabolites using the feeding study food intake records. Metabolomic-based dietary patterns were estimated from 24-hour dietary recalls, FFQ and 4DFR in the Observational Study using a stepwise approach. Cox regression estimated cancer risk of metabolomic-calibrated dietary patterns with a median follow-up of 15.8 years. Adjusted R2 for HEI-2010 and AHEI-2010 calibration equations were 57.5% and 48.8% for FFQ, 61.6% and 62.6% for 4DFR, and 52.5% and 53.2% for dietary recalls. Without calibration, a 20% increment in HEI-2010 was associated with lower risk of colorectal (HR=0.94, 95% CI=0.90-0.99), lung (HR=0.90, 95% CI=0.86-0.94), bladder (HR=0.86, 95% CI=0.75-0.99), and total invasive cancers (HR=0.98, 95% CI=0.96-0.99). With metabolomic calibration, higher HEI-2010 was associated with lower risk of lung (HR=0.79, 95% CI=0.71-0.88) and total invasive cancers (HR=0.96, 95% CI=0.92-1.00). Metabolomic-calibrated dietary patterns might mitigate measurement errors and strengthen diet-cancer associations.
Assessing SARS-CoV-2 transmission in African households from the reanalysis of serosurveys
Household transmission studies provided key insights on SARS-CoV-2 transmission in high-income countries but were rarely implemented in Africa. To help fill this gap, we analyzed SARS-CoV-2 seroprevalence studies with a household-based recruitment, focussing on households with ≤7 members, in four Sub-Saharan African cities: Kinshasa (82 households, 370 individuals), Lubumbashi (225 households, 970 individuals), Conakry (149 households, 649 individuals), and Yaoundé (311 households, 1,183 individuals), between late 2020 and mid-2021. Using an extended chain-binomial model accounting for missing serology, we estimated both the probability of community-acquired infection and within-household transmission. The proportion infected in the community rose sharply over time, reaching up to 73% by June 2021. Household transmission varied by location, with secondary attack rates (SAR) ranging from 8.9% to 26.7%, and households accounting for 9% to 28% of infections. Simulations showed that including households with missing serology improved the precision of estimates without introducing bias. SAR estimates were consistent with findings from South Africa and slightly lower than global pooled estimates, mostly from high-income settings, suggesting different transmission dynamics in African contexts. Our approach for handling missing serology can improve transmission estimates accuracy.
Evaluation of potential approaches for counting person-time in instances where no active comparator is present
Emulating the target trial framework in pharmacoepidemiology is challenging when there is no active comparator. We evaluate six approaches to finding surrogate index dates for untreated patients with the goal of identifying one or more solutions that indicate they would give potentially unbiased results. This numerical experiment used 73,070 patients from the MarketScan administrative databases (2013-2019) with type II diabetes, first-line therapy with metformin, and second-line therapy with either sodium-glucose cotransporter 2 inhibitors (SGLT2i's) or sulfonylureas. Patients taking sulfonylureas were converted into an experimental "untreated" arm. Part 1 sought to find surrogate index dates for the untreated arm. Part 2 compared the experimental estimates of the effect of SGLT2i's on cardiovascular disease (CVD) compared to sulfonylureas, using the surrogate index dates, to the reference estimate. The reference hazard ratio (HR) was 0.69. The HRs after the respective approaches for selecting surrogate index dates are as follows: rejection sampling 0.61, 0.63; median 1.10, 1.15; prediction model 0.96; matching algorithm 1.07. Only the rejection sampling approaches for selecting a surrogate index date provided results which indicate low amounts of potential bias. Extreme care should be taken when making study design decisions for observational research questions that lack an active comparator group.
A preconception cohort study of historical mortgage lending discrimination and present-day fecundability
We estimated the association between redlining, a historic racist practice of mortgage lending discrimination, and fecundability, the per-cycle probability of conception. We analyzed data from 1901 U.S. female participants aged 21-45 in Pregnancy Study Online (PRESTO; 2013-2023), a prospective preconception cohort study. Participants completed self-administered questionnaires at baseline and every 2 months until conception or 12 months, whichever came first. Using participants' baseline residential addresses, we linked to neighborhoods graded by the U.S. Home Owners' Loan Corporation (HOLC) during the 1930s for perceived riskiness of mortgage lending: A + B (best or still desirable), C (definitely declining), and D (hazardous; ie, redlined). We used proportional probabilities regression models to estimate fecundability ratios (FRs) and 95% confidence intervals (CIs), adjusting for age, calendar year of enrollment, and geographic region of residence. Most participants resided in neighborhoods with riskier grades: 47.1% grade C and 21.7% grade D. Compared with neighborhoods graded A + B, FRs were 0.91 (95% CI, 0.81-1.03) and 0.86 (95% CI, 0.74-1.00) for neighborhoods graded C and D, respectively. In this preconception cohort study, current residence in a historically redlined or declining neighborhood was associated with a moderate decrease in fecundability.
Sampling for computational efficiency when conducting analyses in big data
A challenge to research in big data is the inherent computational intensity of analyses, particularly when using rigorous methods to address biases. We demonstrate the use of sampling methods in big data to estimate parameters using fewer resources. Our motivating question was whether lung cancer incidence differs by baseline HIV status, using a cohort of nearly 30 million Medicaid beneficiaries. We targeted three parameters (with listed estimator): incidence rate ratio (IRR, Poisson model), hazard ratio (HR, Cox model), and risk ratio (RR, Kaplan-Meier). We controlled for confounders using inverse probability weighting. We ran analyses using the full sample and several sampling schemes: divide-and-recombine (10, 20, 50 samples), sub-cohort, and case-cohort. We compared point estimates, standard errors, computation time, and memory used. We observed 1113 incident lung cancer diagnoses among 180,980 beneficiaries with HIV and 33,106 diagnoses among 29,179,940 beneficiaries without HIV. Findings were similar across target parameters. The sub-cohort and case-cohort approaches had estimates closer to the full sample and were faster and less memory-intensive than divide-and-recombine, especially when estimating the RR. Including non-sampled cases in the case-cohort resulted in increases in computation time and memory relative to the sub-cohort approach.
Association of Naturalistic E-Cigarette Use and Smoking Cessation in U.S. Adults
Temporal imprecision and unaddressed confounding limit inferences whether e-cigarette vaping provides real-world benefits or harms for combustible cigarette smoking cessation. This naturalistic, multilevel longitudinal study of US adults examined whether (semi-)monthly transitions in e-cigarette vaping were associated with attaining smoking abstinence and subsequent post-quit relapse. A nationally representative panel of 1,255 adults who smoked at baseline completed up to 22 bi-weekly and 13 monthly surveys between May 2020 and May 2022. Using multilevel longitudinal models, we assessed within-person transitions in past-7-day vaping status (none, non-daily, daily) as time-lagged, time-varying predictors of 7-day point-prevalence smoking outcomes (0 vs. 1-7 days). Results showed that daily vaping was associated with higher odds of achieving smoking abstinence two weeks later (30.4% vs. 16.3%; adjusted-RR[95%CI]=2.27[1.41-3.69]), while non-daily vaping was not significantly associated (22.5% vs. 16.3%; adjusted-RR[95%CI]=1.05[0.85-1.37]). In 557 instances where participants had achieved at least one month of smoking abstinence, both daily (adjusted-HR[95%CI]=1.31[1.09-1.70]) and non-daily (adjusted-HR[95%CI]=2.82[2.07-4.61]) vaping were linked to increased relapse risk compared to no vaping. These findings suggest that while daily vaping may support short-term smoking cessation, it is associated with a heightened risk of relapse among individuals who have already quit.
The influence of maternal life course exposure to neighborhood affluence and disadvantage on Black/White disparities in adverse birth outcomes in South Carolina
Exposure of mothers to neighborhood affluence or disadvantage during their childhood or adulthood may influence infant birth outcomes. Concurrently, these neighborhood exposures may be differently prevalent and impactful on health by race. Using a multigenerational dataset of maternally linked birth certificates from South Carolina (1989-2020), we investigated associations between maternal life course neighborhood exposure to affluence (neighborhood college completion) and disadvantage (neighborhood poverty) and infant low birth weight (LBW) and preterm birth (PTB), and differences by maternal Black/White race. Black women had a higher prevalence of LBW and PTB than White women, overall and within every exposure category. Black women were more often exposed to neighborhood disadvantage (high poverty), and less often exposed to neighborhood affluence (high college completion), in childhood and in adulthood than White women. Life course high (vs. low) affluence and low (vs. high) disadvantage neighborhood exposures were protectively associated with LBW, though only the latter was protective for PTB. Though we did not find evidence of differential vulnerability by maternal race to life course neighborhood exposures, the greater life course exposure of Black mothers to less affluent and more disadvantaged neighborhoods explained up to 9% of racial disparities even without effect modification present.
Association of State-Level Structural Racism with Subjective Cognitive Decline Prevalence-Behavioral Risk Factor Surveillance System, 2015-2016
This study examines whether exposure to structural racism (SR) is associated with subjective cognitive decline (SCD) among older adults. Data were from the 2010 Structural Racism-Related State Laws Database and the 2015-2016 Behavioral Risk Factor Surveillance System (n = 184 731). Exposure was an index of 22 state laws related to criminal justice, economics, healthcare, housing, immigration, and political participation. SCD was self-reported. In the full sample, results from the adjusted mixed-effects logistic regression model did not provide strong evidence of a difference in the odds of SCD for a two-unit greater SR value (OR = 1.00; 95% CI: 0.99-1.01). Similarly, results derived from the effect measure modification models, including product terms between SR and race/ethnicity, did not provide strong evidence of a difference in odds of SCD for White (OR = 0.99; 95% CI: 0.97-1.02) and Black (OR = 1.00; 95% CI: 0.97-1.02) respondents for a two-unit greater SR value. Conversely, Hispanic (OR = 0.91; 95% CI: 0.88-0.93) and Multiracial (OR = 0.95; 95% CI: 0.93-0.97) adults had lower odds of SCD for a two-unit greater SR value. Age-related differences in the association between SR and SCD were observed, with younger Native Hawaiian/Other Pacific Islanders (OR = 0.91; 95% CI: 0.88-0.94) and older American Indian/Alaska Natives (OR = 0.92; 95% CI: 0.89-0.94) experiencing lower odds of SCD for a two-unit greater SR value.
Pregnancy identification method as a source of bias in studies of prenatal exposures using real-world data
Researchers typically identify pregnancies in healthcare data based on observed outcomes. This approach misses pregnancies that received prenatal care but whose outcomes were not recorded, potentially inducing selection bias in prenatal effect estimates. Alternatively, prenatal encounters can be used to identify pregnancies with unobserved outcomes, but this requires addressing loss to follow-up (LTFU). We simulated 10,000,000 pregnancies and estimated the total effect of treatment on preeclampsia. Across 36 scenarios, we varied the treatment effect on miscarriage and/or preeclampsia; percent LTFU (5% or 20%); and cause of LTFU: (1) measured covariates, (2) unobserved miscarriage, and (3) both. We created analytic samples to address LTFU-observed deliveries, observed deliveries and miscarriages, and all pregnancies-and estimated treatment effects using non-parametric direct standardization. Risk differences (RDs) and risk ratios (RRs) from the samples were similarly biased when LTFU was due to miscarriage (log-transformed RR bias: -0.12-0.33 among observed deliveries; -0.11-0.32 among observed deliveries and miscarriages; and -0.11-0.32 among all pregnancies). When predictors of LTFU were measured, only estimates among all pregnancies were unbiased (-0.27-0.33; -0.29-0.03; and -0.02-0.01, respectively). While including all pregnancies does not prevent bias, it quantifies the extent of selection, enabling direct assessment of its potential impact on findings.
A framework for the rigorous assessment of heterogeneous treatment effects from a single randomized controlled trial
Randomized controlled trials are the gold standard for estimating the average effect of a treatment in a target population, but the same treatment may benefit some patients while having no effect on or even harming others. This phenomenon, termed heterogeneous treatment effects, can be quantified by estimating treatment effects within subgroups of patients, defined by various combinations of baseline covariates. One approach for quantifying heterogeneous treatment effects is to develop "effect models" that directly model complex interactions between baseline covariates and treatment assignment. "Effect scores", derived from effect models, can then be used to rank patients based on their predicted treatment benefit, enabling targeted treatment regimens. In this article, we provide a rigorous general framework for developing and evaluating effect models to characterize heterogeneous treatment effects from a single randomized control trial. We address challenges in valid model development, such as overfitting, and illustrate our approach in a real-world dataset with time-to-event outcomes subject to right-censoring.
Enhancing COVID-19 vaccine effectiveness evidence generation using tokenized immunization registries
To describe a novel data ecosystem and quantify improvements in documented COVID-19 vaccine uptake using closed claims data and state vaccine registries.
Recommendations for estimating and reporting vaccine effectiveness by time since vaccination: a COVID-19 case study
Estimating COVID-19 vaccine effectiveness (VE) by time since vaccination (TSV) is essential for understanding how protection may change over time and enables meaningful comparisons across studies. This is important for accurate comparisons of VE against different SARS-CoV-2 variants/sublineages, across age groups, during different periods post vaccination campaign, or by vaccine type/brand. We provide recommendations for case-control VE studies on estimating and reporting VE analyses by TSV, with the aim of improving quality of these estimates. Our recommendations cover study design and pre-analysis considerations, descriptive analyses, choice of categories of TSV, categorical and continuous modelling approaches, and best practices for reporting VE by TSV. Using a real-life case-control study, we apply these recommendations, and include accompanying statistical scripts in R and Stata. These recommendations will serve as a practical resource for researchers conducting VE analyses by TSV. We encourage ongoing refinement of them through input from other study groups.
Using directed acyclic graphs to determine whether multiple imputation or subsample-multiple imputation estimates of an exposure-outcome association are unbiased
Missing data is a pervasive problem in epidemiology, with multiple imputation (MI) a commonly used analysis method. MI is valid when data are missing at random (MAR). However, definitions of MAR with multiple incomplete variables are not easily interpretable and descriptions of graphical model-based conditions are not accessible to applied researchers. Previous literature shows that MI may be valid in subsamples, even if not in the full dataset. Practical guidance on applying MI with multiple incomplete variables is lacking. We present an algorithm using directed acyclic graphs to determine when MI will estimate an exposure-outcome coefficient without bias. We extend the algorithm to assess whether MI in a subsample of the data, in which some variables are complete, and the remaining are imputed, will be valid and unbiased for the exposure-outcome coefficient. We apply the algorithm to several simple exemplars, and in a more complex real-life example highlight that only subsample-MI of the outcome would be valid. Our algorithm provides researchers with the tools to decide whether to use MI in practice when there are multiple incomplete variables. Further work could focus on the likely size and direction of biases, and the impact of different missing data patterns.
Augmenting fact and date of death in electronic health records using internet media sources: a validation study from two large healthcare systems
This study evaluated death ascertainment from publicly available internet sources for patients in two large tertiary care US healthcare systems, Mass General Brigham (MGB) and Vanderbilt University Medical Center (VUMC), benchmarked against state and federal vital statistics data. Names, dates of birth, and dates of death were extracted from 8.1 million internet media records using previously developed natural language processing models. Internet records were matched to 78 848 deceased patients from MGB and VUMC on first name, last name, and date of birth. Dates of death were validated against state vital statistics databases or the National Death Index as reference standards. We calculated sensitivity and positive predicted values (PPV) of internet sources in identifying dates of death within 7 days of the reference standard. Exact matching of records between internet media and reference standards on first name, last name, and date of birth, resulted in 30 067 (38.8%) matches, which showed PPV for death identification (98.2%-MGB; 98.9%-VUMC) in internet media and increased sensitivity of death capture over EHR alone by 24% at MGB and 18% at VUMC. In conclusion, using internet sources to augment mortality data increased capture of death meaningfully over reliance on EHR records alone.
Hypertensive Disorders of Pregnancy, Maternal Cardiovascular Disease Mortality and the role of Familial Predisposition: A Norwegian Population-Based Sibling-Comparison, Sibling-Spillover and Negative-Control Cohort Study
Hypertensive disorders of pregnancy (HDP) are associated with increased maternal cardiovascular disease (CVD) mortality, with risks varying by HDP subtypes and subsequent pregnancy outcomes. The contribution of shared familial factors given this heterogeneity is unclear. We conducted a population-based study using Norwegian registries (1967-2020) including 1,106,658 women with complete pregnancy histories, of whom 628,345 had at least one full sibling. Women with HDP were classified into low-risk (gestational hypertension or term preeclampsia followed by no HDPs) and high-risk (all other patterns) trajectories. CVD mortality before age 70 was assessed using population-level and sibling-based models: sibling-comparison (discordant-sisters), sibling-spillover (by sister's HDP history), and negative-control models (by sister-in-law's HDP history). CVD mortality among women with HDP varied by trajectory (population-level adjusted hazard ratios [aHR]low-risk 1.03 [95% confidence intervals 0.89-1.20]; aHRhigh-risk 1.89 [1.74-2.06]). These differences persisted when compared to sisters without HDP (sibling-comparison aHRlow-risk 0.66 [0.44-1.01]; aHRhigh-risk 1.51 [1.16-1.97]). Women without HDP had slightly elevated CVD mortality if sisters had HDP (sibling-spillover aHRlow-risk 1.28 [1.03-1.60]; aHRhigh-risk 1.25 [1.06-1.49]), but not if sisters-in-law had HDP (negative-control aHRlow-risk 1.10 [0.85-1.40]; aHRhigh-risk 1.01 [0.83-1.22]). Individual-specific factors drive the CVD mortality heterogeneity among women with HDP. Shared familial factors modestly elevate CVD mortality in women without HDP.
Marginal structural models for quantifying the causal effects of exposure to ambient air pollution on progression of CT emphysema in the MESA Lung and MESA Air Studies
Associations between exposure to ambient air pollution and progression of emphysema have been identified in longitudinal observational studies. However, previous work has not used statistical causal inference methods tailored to address bias from time-varying confounding. The objective of this study is to propose an analytical approach for estimating longitudinal health effects of air pollution while accounting for time-varying confounding using marginal structural models and to re-analyze data on air pollution and emphysema progression from the Multi-Ethnic Study of Atherosclerosis (MESA) using this analytical approach. We estimate weights for continuous exposure levels using two techniques: quantile binning of the exposure and a semiparametric model for the requisite conditional densities. The latter approach incorporates flexible machine learning methods. We find evidence for the harmful effects of ambient ozone pollution during study follow-up on the progression of emphysema, consistent with previously reported results. We find no evidence of effects of NOx during study follow-up. This investigation demonstrates that analyses based on marginal structural models are feasible in studies of the health effects of air pollution and may address possible sources of bias that traditional regression-based methods fail to address. Further investigation is warranted to understand differences between our findings and previously published results.
Data validation in multinational observational studies with error-prone data: applying an optimal validation sampling strategy in a study of Kaposi sarcoma and HIV
A large multi-center study was conducted to investigate factors associated with Kaposi sarcoma (KS) among people with HIV (PWH). The study used routinely collected (eg, electronic health record) data from 257 429 PWH in Latin America and East Africa. Although the routinely collected data contain rich information on key clinical and demographic variables, previous chart reviews of these datasets have raised some concerns about their accuracy. While validating all data is impractical, a subset of participants' records can be validated and then combined with the error-prone data using techniques developed for measurement error and missing data problems to obtain consistent estimators and valid inference. A key step is thus choosing which records to validate, particularly with a rare outcome such as KS. Validated records should be informative for the research questions, while keeping the selection probabilistic to have results generalizable for the study population. We describe an optimal multi-wave validation procedure to internally validate 1000 patient records to maximize precision of parameters of interest and better understand the incidence and prevalence of KS among PWH. We also describe challenges encountered with implementing the optimal validation design in a complex setting with a rare outcome and multiple study sites across two continents.
On Modeling the Shared Environment
The shared environment is a key component in estimating the heritability of phenotypic traits from familial data. Unlike the genetic components, defined by the kinship parameters, the shared environment stems from complex, latent processes that vary in nature and impact across traits. This concept is intricate, encompassing latent variables and ambiguous interpretations that differ across scientific disciplines. A common approach assumes a 100% correlation for the shared environment among all family members. However, this model has inherent limitations and may fail to capture the dynamics of the conditions that constitute the shared environment within and across traits. This study explores aspects of the shared environment and adopts a more general approach to modeling its correlation structure. We introduce models that represent different dynamic structures, enabling alternative interpretations of shared environmental influence on the transmission of phenotypic traits. A more realistic correlation structure for the shared environment will result in more accurate and precise heritability estimates for a given trait, as well as deeper understanding of its etiology. We demonstrate the performance of our proposed models through simulations and application to data on body mass index (BMI) and systolic blood pressure (SBP) from Norwegian health surveys linked to family data.
Damp housing conditions as a determinant of psychological distress: a longitudinal analysis of the British Household Panel Survey
Limited evidence exists regarding whether damp housing contributes to psychological distress. This study aimed to quantify the relationship between damp housing exposure and psychological distress. Data from the British Household Panel Survey (1996-2008) were used to assess the effect of damp housing on psychological distress in British households (n = 9,189 at baseline). Indoor dampness exposure was measured using multiple indicators (condensation, leaky roof, rot, and damp walls/floors) and a measure of severity that quantified the number of exposures. Psychological distress was measured using a binary variable derived from the General Health Questionnaire. Multivariate fixed effects logistic regression models analysed the hypothesised associations. Exposure to damp housing was associated with increased odds of psychological distress (OR = 1.09, 95% CI: 1.05, 1.14, P<0.01). Condensation was the strongest predictor (OR = 1.09, 95% CI: 1.03, 1.13, P<0.01). With each additional dampness indicator, odds of psychological distress increased by 4% (OR = 1.04, 95% CI: 1.02, 1.07, P<0.01). Among combinations of dampness indicators, the strongest association was for condensation and rot in windows/floors (OR = 1.25, 95% CI: 1.11, 1.40, P<0.01). These findings suggest damp housing exposure may increase the risk of psychological distress. Further research should investigate underlying mechanisms.
Childhood adversity and spontaneous abortion in a north American preconception cohort study
Childhood adversity has been associated with adverse adult health outcomes. We investigated its association with spontaneous abortion (SAB) risk and the potential buffering effects of social support and integration (SSI). This analysis included 6100 participants from Pregnancy Study Online, a North American preconception cohort study of females attempting spontaneous conception (2013-2024). We assessed childhood adversity via the Adverse Childhood Experiences (ACE) scale and Brief Trauma Questionnaire (BTQ), lifetime SSI via the Berkman-Syme Social Network Index, and pregnancy outcomes via follow-up questionnaires. We used Cox proportional hazards regression models to estimate hazard ratios (HR) and 95% confidence intervals (CI), adjusted for potential confounders. Neither the ACE score nor individual ACE domains were appreciably associated with SAB risk. However, participants who reported childhood physical (HR = 1.11, 95% CI: 0.92-1.35), sexual (HR = 1.12, 95% CI:0.96-1.30), or both abuse types (HR = 1.09, 95% CI:0.90-1.32) on the BTQ had slightly increased SAB risk compared with those who reported no abuse. Associations were stronger among participants who reported lower childhood SSI (physical and sexual abuse vs. no abuse: HR = 1.76, 95% CI:1.15-2.68). These findings indicate that BTQ-ascertained physical and sexual abuse may be associated with SAB risk among those with lower childhood SSI.
Constructing targeted minimum loss/maximum likelihood estimators: a simple illustration to build intuition
Machine learning is increasingly used to estimate nuisance functions in causal inference. The efficient influence function (EIF) offers a principled way to construct estimators that can incorporate machine learning with valid inference (e.g., estimate valid conference intervals). In this Tutorial, we illustrate how to construct targeted maximum likelihood/minimum loss estimators (TMLE) from the EIF, a topic that is well-covered in statistical literature but remains less accessible to applied researchers. A companion paper, Renson et al. 2025 (AJE, kwaf169) provides a thorough, but approachable description of the EIF and its derivation for a statistical estimand.
