On the Connections Among Three Transfer Learning Paradigms
We examine the solution paths of three transfer learning estimators within the context of linear models. Our analysis identifies a general solution path that characterizes how transfer learning estimators interpolate between the target and source estimators. This involves a change of basis and entry-wise weighting functions, with existing transfer learning methods identified as special cases. The proposed framework reveals connections and equivalences among transfer learning approaches, offering valuable insights for designing future estimators with improved control over the learning process. It extends beyond traditional penalized regression and gradient descent techniques and holds the potential for generalization to nonlinear cases. Extensive simulations validate the theoretically derived solution paths, and their practical utility is demonstrated by the improvement in risk prediction for end-stage renal disease in Hispanic populations.
What is it that you say you do here? Advocating for the critical role of data scientists in research infrastructure
Clinical and academic research continues to become more complex as our knowledge and technology advance. A substantial and growing number of specialists in biostatistics, data science, and library sciences are needed to support these research systems and promote high-caliber research. However, that support is often marginalized as optional rather than a fundamental component of research infrastructure. By building research infrastructure, an institution harnesses access to tools and support/service centers that host skilled experts who approach research with best practices in mind and domain-specific knowledge at hand. We outline the potential roles of data scientists and statisticians in research infrastructure and recommend guidelines for advocating for the institutional resources needed to support these roles in a sustainable and efficient manner for the long-term success of the institution. We provide these guidelines in terms of resource efficiency, monetary efficiency, and long-term sustainability. We hope this work contributes to-and provides shared language for-a conversation on a broader framework beyond metrics that can be used to advocate for needed resources.
Methods for building a staff workforce of quantitative scientists in academic health care
Collaborative quantitative scientists, including biostatisticians, epidemiologists, bio-informaticists, and data-related professionals, play vital roles in research, from study design to data analysis and dissemination. It is imperative that academic health care centers (AHCs) establish an environment that provides opportunities for the quantitative scientists who are hired as staff to develop and advance their careers. With the rapid growth of clinical and translational research, AHCs are charged with establishing organizational methods, training tools, best practices, and guidelines to accelerate and support hiring, training, and retaining this staff workforce. This paper describes three essential elements for building and maintaining a successful unit of collaborative staff quantitative scientists in academic health care centers: (1) organizational infrastructure and management, (2) recruitment, and (3) career development and retention. Specific strategies are provided as examples of how AHCs can excel in these areas.
Accelerating Resident Research within Quantitative Collaboration Units in Academic Healthcare
With increased access to biomedical and electronic health records data and the complexity of research questions, individuals in residency programs who aim to conduct research require specialized educational programs and biostatistics support. Biostatistics collaboration units in academic health centers often work with residents to conduct data-intensive research. These units face numerous challenges related to providing training in statistical literacy and collaborating on resident-led research within very restricted timelines. Since 2019, the Duke Biostatistics, Epidemiology, and Research Design (BERD) Methods Core has supported over 247 resident-led projects by developing tools and resources to address these challenges. This manuscript presents novel processes and training materials that other institutions can use to help biostatistics collaboration units effectively support resident training programs. We provide a framework to support the development of collaborative teams, along with specialized training materials for residents who collaborate with these teams.
Multivariate differential association analysis
Identifying how dependence relationships vary across different conditions plays a significant role in many scientific investigations. For example, it is important for the comparison of biological systems to see if relationships between genomic features differ between cases and controls. In this paper, we seek to evaluate whether relationships between two sets of variables are different or not across two conditions. Specifically, we assess: We propose a new kernel-based test to capture the differential dependence. Specifically, the new test determines whether two measures that detect dependence relationships are similar or not under two conditions. We introduce the asymptotic permutation null distribution of the test statistic and it is shown to work well under finite samples such that the test is computationally efficient, significantly enhancing its usability in analyzing large datasets. We demonstrate through numerical studies that our proposed test has high power for detecting differential linear and non-linear relationships. The proposed method is implemented in an R package kerDAA.
A guide to successful management of collaborative partnerships in quantitative research: An illustration of the science of team science
Data-intensive research continues to expand with the goal of improving healthcare delivery, clinical decision-making, and patient outcomes. Quantitative scientists, such as biostatisticians, epidemiologists, and informaticists, are tasked with turning data into health knowledge. In academic health centres, quantitative scientists are critical to the missions of biomedical discovery and improvement of health. Many academic health centres have developed centralized Quantitative Science Units which foster dual goals of professional development of quantitative scientists and producing high quality, reproducible domain research. Such units then develop teams of quantitative scientists who can collaborate with researchers. However, existing literature does not provide guidance on how such teams are formed or how to manage and sustain them. Leaders of Quantitative Science Units across six institutions formed a working group to examine common practices and tools that can serve as best practices for Quantitative Science Units that wish to achieve these dual goals through building long-term partnerships with researchers. The results of this working group are presented to provide tools and guidance for Quantitative Science Units challenged with developing, managing, and evaluating Quantitative Science Teams. This guidance aims to help Quantitative Science Units effectively participate in and enhance the research that is conducted throughout the academic health centre-shaping their resources to fit evolving research needs.
Regularized Buckley-James method for right-censored outcomes with block-missing multimodal covariates
High-dimensional data with censored outcomes of interest are prevalent in medical research. To analyze such data, the regularized Buckley-James estimator has been successfully applied to build accurate predictive models and conduct variable selection. In this paper, we consider the problem of parameter estimation and variable selection for the semiparametric accelerated failure time model for high-dimensional block-missing multimodal neuroimaging data with censored outcomes. We propose a penalized Buckley-James method that can simultaneously handle block-wise missing covariates and censored outcomes. This method can also perform variable selection. The proposed method is evaluated by simulations and applied to a multimodal neuroimaging dataset and obtains meaningful results.
The development of a mobile app-focused deduplication strategy for the Apple Heart Study that informs recommendations for future digital trials
An app-based clinical trial enrolment process can contribute to duplicated records, carrying data management implications. Our objective was to identify duplicated records in real time in the Apple Heart Study (AHS). We leveraged personal identifiable information (PII) to develop a dissimilarity score (DS) using the Damerau-Levenshtein distance. For computational efficiency, we focused on four types of records at the highest risk of duplication. We used the receiver operating curve (ROC) and resampling methods to derive and validate a decision rule to classify duplicated records. We identified 16,398 (4%) duplicated participants, resulting in 419,297 unique participants out of a total of 438,435 possible. Our decision rule yielded a high positive predictive value (96%) with negligible impact on the trial's original findings. Our findings provide principled solutions for future digital trials. When establishing deduplication procedures for digital trials, we recommend collecting device identifiers in addition to participant identifiers; collecting and ensuring secure access to PII; conducting a pilot study to identify reasons for duplicated records; establishing an initial deduplication algorithm that can be refined; creating a data quality plan that informs refinement; and embedding the initial deduplication algorithm in the enrolment platform to ensure unique enrolment and linkage to previous records.
A comprehensive survey of collaborative biostatistics units in academic health centers
The organizational structures of collaborative biostatistics units in academic health centers (AHCs) in the United States and their important contributions to research are an evolving and active area of discussion and inquiry. Collaborative biostatistics units may serve as a centralized resource to investigators across various disciplines or as shared infrastructure for investigators within a discipline (e.g., cancer), or a combination of both. The characteristics of such units vary greatly, and there has been no comprehensive review of their organizational structures described in the literature to date. This manuscript summarizes the current infrastructure of such units using responses from 129 leaders. Most leaders were over 45 years old, held doctoral degrees, and were on a 12-month appointment. Over half were tenured or on a tenure track and held primary appointments in a school of medicine. Career advancement metrics most important included being funded as co-investigator on NIH grants and being either first or second author on peer-reviewed publications. Team composition was diverse in terms of expertise and training, and funding sources were typically hybrid. These results provide a benchmark for collaboration models and evaluation and may be used by institutional administrators as they build, evaluate, or restructure current collaborative quantitative support infrastructure.
A modified SEIR model with a jump in the transmission parameter applied to COVID-19 data on Wuhan
In December 2019, Wuhan, the capital of Hubei Province, was struck by an outbreak of COVID-19. Numerous studies have been conducted to fit COVID-19 data and make statistical inferences. In applications, functions of the parameters in the model are usually used to assess severity of the outbreak. Because of the strategies applied during the struggle against the pandemic, the trend of the parameters changes abruptly. However, time-varying parameters with a jump have received scant attention in the literature. In this study, a modified SEIR model is proposed to fit the actual situation of the COVID-19 epidemic. In the proposed model, the dynamic propagation system is modified because of the high infectivity during incubation, and a time-varying parametric strategy is suggested to account for the utility of the intervention. A corresponding model selection algorithm based on the information criterion is also suggested to detect the jump in the transmission parameter. A real data analysis based on the COVID-19 epidemic in Wuhan and a simulation study demonstrate the plausibility and validity of the proposed method.
A hierarchical meta-analysis for settings involving multiple outcomes across multiple cohorts
Evidence from animal models and epidemiological studies has linked prenatal alcohol exposure (PAE) to a broad range of long-term cognitive and behavioural deficits. However, there is a paucity of evidence regarding the nature and levels of PAE associated with increased risk of clinically significant cognitive deficits. To derive robust and efficient estimates of the effects of PAE on cognitive function, we have developed a hierarchical meta-analysis approach to synthesize information regarding the effects of PAE on cognition, integrating data on multiple outcomes from six U.S. Iongitudinal cohort studies. A key assumption of standard methods of meta-analysis, effect sizes are independent, is violated when multiple intercorrelated outcomes are synthesized across studies. Our approach involves estimating the dose-response coefficients for each outcome and then pooling these correlated dose-response coefficients to obtain an estimated "global" effect of exposure on cognition. In the first stage, we use individual participant data to derive estimates of the effects of PAE by fitting regression models that adjust for potential confounding variables using propensity scores. The correlation matrix characterizing the dependence between the outcome-specific dose-response coefficients estimated within each cohort is then run, while accommodating incomplete information on some outcome. We also compare inferences based on the proposed approach to inferences based on a full multivariate analysis.
A Unified Approach for Outliers and Influential Data Detection - The Value of Information in Retrospect
Identifying influential and outlying data is important as it would guide the effective collection of future data and the proper use of existing information. We develop a unified approach for outlier detection and influence analysis. Our proposed method is grounded in the intuitive value of information concepts and has a distinct advantage in interpretability and flexibility when compared to existing methods: it decomposes the data influence into the leverage effect (expected to be influential) and the outlying effect (surprisingly more influential than being expected); and it applies to all decision problems such as estimation, prediction, and hypothesis testing. We study the theoretical properties of three value of information quantities, establish the relationship between the proposed measures and classic measures in the linear regression setting, and provide real data analysis examples of how to apply the new value of information approach in the cases of linear regression, generalized linear mixed model, and hypothesis testing.
Count-valued time series models for COVID-19 daily death dynamics
We propose a generalized non-linear state-space model for count-valued time series of COVID-19 fatalities. To capture the dynamic changes in daily COVID-19 death counts, we specify a latent state process that involves second-order differencing and an AR(1)-ARCH(1) model. These modelling choices are motivated by the application and validated by model assessment. We consider and fit a progression of Bayesian hierarchical models under this general framework. Using COVID-19 daily death counts from New York City's five boroughs, we evaluate and compare the considered models through predictive model assessment. Our findings justify the elements included in the proposed model. The proposed model is further applied to time series of COVID-19 deaths from the four most populous counties in Texas. These model fits illuminate dynamics associated with multiple dynamic phases and show the applicability of the framework to localities beyond New York City.
A mutual information criterion with applications to canonical correlation analysis and graphical models
This paper derives a criterion for deciding conditional independence that is consistent with small-sample corrections of Akaike's information criterion but is easier to apply to such problems as selecting variables in canonical correlation analysis and selecting graphical models. The criterion reduces to mutual information when the assumed distribution equals the true distribution; hence, it is called mutual information criterion (MIC). Although small-sample Kullback-Leibler criteria for these selection problems have been proposed previously, some of which are not widely known, MIC is strikingly more direct to derive and apply.
Stochastic actor-oriented modelling of the impact of COVID-19 on financial network evolution
The coronavirus disease 2019 (COVID-19) pandemic has led to tremendous loss of human life and has severe social and economic impacts worldwide. The spread of the disease has also caused dramatic uncertainty in financial markets, especially in the early stages of the pandemic. In this paper, we adopt the stochastic actor-oriented model (SAOM) to model dynamic/longitudinal financial networks with the covariates constructed from the network statistics of COVID-19 dynamic pandemic networks. Our findings provide evidence that the transmission risk of the COVID-19, measured in the transformed pandemic risk scores, is a main explanatory factor of financial network connectedness from March to May 2020. The pandemic statistics and transformed pandemic risk scores can give early signs of the intense connectedness of the financial markets in mid-March 2020. We can make use of the SAOM approach to predict possible financial contagion using pandemic network statistics and transformed pandemic risk scores of the COVID-19 and other pandemics.
Bayesian nonparametric multiway regression for clustered binomial data
We introduce a Bayesian nonparametric regression model for data with multiway (tensor) structure, motivated by an application to periodontal disease (PD) data. Our outcome is the number of diseased sites measured over four different tooth types for each subject, with subject-specific covariates available as predictors. The outcomes are not well characterized by simple parametric models, so we use a nonparametric approach with a binomial likelihood wherein the latent probabilities are drawn from a mixture with an arbitrary number of components, analogous to a Dirichlet process. We use a flexible probit stick-breaking formulation for the component weights that allows for covariate dependence and clustering structure in the outcomes. The parameter space for this model is large and multiway: × × × . We reduce its effective dimensionality and account for the multiway structure, via low-rank assumptions. We illustrate how this can improve performance and simplify interpretation while still providing sufficient flexibility. We describe a general and efficient Gibbs sampling algorithm for posterior computation. The resulting fit to the PD data outperforms competitors and is interpretable and well calibrated. An interactive visual of the predictive model is available at the website (https://ericfrazerlock.com/toothdata/ToothDisplay.html), and the code is available at the GitHub (https://github.com/lockEF/NonparametricMultiway).
A spatiotemporal case-crossover model of asthma exacerbation in the City of Houston
Case-crossover design is a popular construction for analyzing the impact of a transient effect, such as ambient pollution levels, on an acute outcome, such as an asthma exacerbation. Case-crossover design avoids the need to model individual, time-varying risk factors for cases by using cases as their own 'controls', chosen to be time periods for which individual risk factors can be assumed constant and need not be modelled. Many studies have examined the complex effects of the control period structure on model performance, but these discussions were simplified when case-crossover design was shown to be equivalent to various specifications of Poisson regression when exposure is considered constant across study participants. While reasonable for some applications, there are cases where such an assumption does not apply due to spatial variability in exposure, which may affect parameter estimation. This work presents a spatiotemporal model, which has temporal case-crossover and a geometrically aware spatial random effect based on the Hausdorff distance. The model construction incorporates a residual spatial structure in cases when the constant assumption exposure is not reasonable and when spatial regions are irregular.
Robust inference for non-linear regression models from the Tsallis score: Application to coronavirus disease 2019 contagion in Italy
We discuss an approach of robust fitting on non-linear regression models, in both frequentist and Bayesian approaches, which can be employed to model and predict the contagion dynamics of the coronavirus disease 2019 (COVID-19) in Italy. The focus is on the analysis of epidemic data using robust dose-response curves, but the functionality is applicable to arbitrary non-linear regression models.
Developing partnerships for academic data science consulting and collaboration units
Data science consulting and collaboration units (DSUs) are core infrastructure for research at universities. Activities span data management, study design, data analysis, data visualization, predictive modelling, preparing reports, manuscript writing and advising on statistical methods and may include an experiential or teaching component. Partnerships are needed for a thriving DSU as an active part of the larger university network. Guidance for identifying, developing and managing successful partnerships for DSUs can be summarized in six rules: (1) align with institutional strategic plans, (2) cultivate partnerships that fit your mission, (3) ensure sustainability and prepare for growth, (4) define clear expectations in a partnership agreement, (5) communicate and (6) expect the unexpected. While these rules are not exhaustive, they are derived from experiences in a diverse set of DSUs, which vary by administrative home, mission, staffing and funding model. As examples in this paper illustrate, these rules can be adapted to different organizational models for DSUs. Clear expectations in partnership agreements are essential for high quality and consistent collaborations and address core activities, duration, staffing, cost and evaluation. A DSU is an organizational asset that should involve thoughtful investment if the institution is to gain real value.
Accounting for extra-binomial variability with differentially expressed genetic pathway data: a collaborative bioinformatic study
We describe a collaborative project involving faculty and students in a university bioinformatics/biostatistics center. The project focuses on identification of differentially expressed gene sets ("pathways") in subjects expressing a disease state, medical intervention, or other distinguishable condition. The key feature of the endeavor is the data structure presented to the team: a single cohort of subjects with two samples taken from each subject - one for each of two differing conditions without replication. This particular structure leads to essentially a cohort of contingency tables, where each table compares the differential gene state with the pathway condition. Recognizing that correlations both within and between pathway responses can disrupt standard table analytics, we develop methods for analyzing this data structure in the presence of complicated intra-table correlations. These provide some convenient approaches for this problem, using design effect adjustments from sample survey theory and manipulations of the summary table counts. Monte Carlo simulations show that the methods operate extremely well, validating their use in practice. In the end, the collaborative connections among the team members led to solutions no one of us would have envisioned separately.
Deep learning models to predict primary open-angle glaucoma
Glaucoma is a major cause of blindness and vision impairment worldwide, and visual field (VF) tests are essential for monitoring the conversion of glaucoma. While previous studies have primarily focused on using VF data at a single time point for glaucoma prediction, there has been limited exploration of longitudinal trajectories. Additionally, many deep learning techniques treat the time-to-glaucoma prediction as a binary classification problem (glaucoma Yes/No), resulting in the misclassification of some censored subjects into the nonglaucoma category and decreased power. To tackle these challenges, we propose and implement several deep-learning approaches that naturally incorporate temporal and spatial information from longitudinal VF data to predict time-to-glaucoma. When evaluated on the Ocular Hypertension Treatment Study (OHTS) dataset, our proposed convolutional neural network (CNN)-long short-term memory (LSTM) emerged as the top-performing model among all those examined. The implementation code can be found online (https://github.com/rivenzhou/VF_prediction).
