spsurvey: Spatial Sampling Design and Analysis in R
is an R package for design-based statistical inference, with a focus on spatial data. provides the generalized random-tessellation stratified (GRTS) algorithm to select spatially balanced samples via the grts() function. The grts() function flexibly accommodates several sampling design features, including stratification, varying inclusion probabilities, legacy (or historical) sites, minimum distances between sites, and two options for replacement sites. also provides a suite of data analysis options, including categorical variable analysis (cat_analysis()), continuous variable analysis cont_analysis()), relative risk analysis (relrisk_analysis()), attributable risk analysis (attrisk_analysis()), difference in risk analysis (diffrisk_analysis()), change analysis (change_analysis()), and trend analysis (trend_analysis()). In this manuscript, we first provide background for the GRTS algorithm and the analysis approaches and then show how to implement them in . We find that the spatially balanced GRTS algorithm yields more precise parameter estimates than simple random sampling, which ignores spatial information.
Regularized Ordinal Regression and the ordinalNet R Package
Regularization techniques such as the lasso (Tibshirani 1996) and elastic net (Zou and Hastie 2005) can be used to improve regression model coefficient estimation and prediction accuracy, as well as to perform variable selection. Ordinal regression models are widely used in applications where the use of regularization could be beneficial; however, these models are not included in many popular software packages for regularized regression. We propose a coordinate descent algorithm to fit a broad class of ordinal regression models with an elastic net penalty. Furthermore, we demonstrate that each model in this class generalizes to a more flexible form, that can be used to model either ordered or unordered categorical response data. We call this the (ELMO) class, and it includes widely used models such as multinomial logistic regression (which also has an ordinal form) and ordinal logistic regression (which also has an unordered multinomial form). We introduce an elastic net penalty class that applies to either model form, and additionally, this penalty can be used to shrink a non-ordinal model toward its ordinal counterpart. Finally, we introduce the R package , which implements the algorithm for this model class.
SeqNet: An R Package for Generating Gene-Gene Networks and Simulating RNA-Seq Data
Gene expression data provide an abundant resource for inferring connections in gene regulatory networks. While methodologies developed for this task have shown success, a challenge remains in comparing the performance among methods. Gold-standard datasets are scarce and limited in use. And while tools for simulating expression data are available, they are not designed to resemble the data obtained from RNA-seq experiments. SeqNet is an R package that provides tools for generating a rich variety of gene network structures and simulating RNA-seq data from them. This produces RNA-seq data for benchmarking and assessing gene network inference methods. The package is available on CRAN and on GitHub at https://github.com/tgrimes/SeqNet.
FamEvent: An R Package for Generating and Modeling Time-to-Event Data in Family Designs
is a comprehensive R package for simulating and modelling age-at-disease onset in families carrying a rare gene mutation. The package can simulate complex family data for variable time-to-event outcomes under three common family study designs (population, high-risk clinic and multi-stage) with various levels of missing genetic information among family members. Residual familial correlation can be induced through the inclusion of a frailty term or a second gene. Disease-gene carrier probabilities are evaluated assuming Mendelian transmission or empirically from the data. When genetic information on the disease gene is missing, an Expectation-Maximization algorithm is employed to calculate the carrier probabilities. Penetrance model functions with ascertainment correction adapted to the sampling design provide age-specific cumulative disease risks by sex, mutation status, and other covariates for simulated data as well as real data analysis. Robust standard errors and 95% confidence intervals are available for these estimates. Plots of pedigrees and penetrance functions based on the fitted model provide graphical displays to evaluate and summarize the models.
MultiBUGS: A Parallel Implementation of the BUGS Modelling Framework for Faster Bayesian Inference
MultiBUGS is a new version of the general-purpose Bayesian modelling software BUGS that implements a generic algorithm for parallelising Markov chain Monte Carlo (MCMC) algorithms to speed up posterior inference of Bayesian models. The algorithm parallelises evaluation of the product-form likelihoods formed when a parameter has many children in the directed acyclic graph (DAG) representation; and parallelises sampling of conditionally-independent sets of parameters. A heuristic algorithm is used to decide which approach to use for each parameter and to apportion computation across computational cores. This enables MultiBUGS to automatically parallelise the broad range of statistical models that can be fitted using BUGS-language software, making the dramatic speed-ups of modern multi-core computing accessible to applied statisticians, without requiring any experience of parallel programming. We demonstrate the use of MultiBUGS on simulated data designed to mimic a hierarchical e-health linked-data study of methadone prescriptions including 425,112 observations and 20,426 random effects. Posterior inference for the e-health model takes several hours in existing software, but MultiBUGS can perform inference in only 28 minutes using 48 computational cores.
idem: An R Package for Inferences in Clinical Trials with Death and Missingness
In randomized controlled trials of seriously ill patients, death is common and often defined as the primary endpoint. Increasingly, non-mortality outcomes such as functional outcomes are co-primary or secondary endpoints. Functional outcomes are not defined for patients who die, referred to as "truncation due to death", and among survivors, functional outcomes are often unobserved due to missed clinic visits or loss to follow-up. It is well known that if the functional outcomes "truncated due to death" or missing are handled inappropriately, treatment effect estimation can be biased. In this paper, we describe the package that implements a procedure for comparing treatments that is based on a composite endpoint of mortality and the functional outcome among survivors. Among survivors, the procedure incorporates a missing data imputation procedure with a sensitivity analysis strategy. A web-based graphical user interface is provided in the package to facilitate users conducting the proposed analysis in an interactive and user-friendly manner. We demonstrate using data from a recent trial of sedation interruption among mechanically ventilated patients.
The Calculus of M-Estimation in R with geex
M-estimation, or estimating equation, methods are widely applicable for point estimation and asymptotic inference. In this paper, we present an R package that can find roots and compute the empirical sandwich variance estimator for any set of user-specified, unbiased estimating equations. Examples from the M-estimation primer by Stefanski and Boos (2002) demonstrate use of the software. The package also includes a framework for finite sample, heteroscedastic, and autocorrelation variance corrections, and a website with an extensive collection of tutorials.
Image Segmentation, Registration and Characterization in R with SimpleITK
Many types of medical and scientific experiments acquire raw data in the form of images. Various forms of image processing and image analysis are used to transform the raw image data into quantitative measures that are the basis of subsequent statistical analysis. In this article we describe the R package. is a simplified interface to the insight segmentation and registration toolkit (). is an open source C++ toolkit that has been actively developed over the past 18 years and is widely used by the medical image analysis community. provides packages for many interpreter environments, including R. Currently, it includes several hundred classes for image analysis including a wide range of image input and output, filtering operations, and higher level components for segmentation and registration. Using , development of complex combinations of image and statistical analysis procedures is feasible. This article includes several examples of computational image analysis tasks implemented using , including spherical marker localization, multi-modal image registration, segmentation evaluation, and cell image analysis.
clustvarsel: A Package Implementing Variable Selection for Gaussian Model-Based Clustering in R
Finite mixture modeling provides a framework for cluster analysis based on parsimonious Gaussian mixture models. Variable or feature selection is of particular importance in situations where only a subset of the available variables provide clustering information. This enables the selection of a more parsimonious model, yielding more efficient estimates, a clearer interpretation and, often, improved clustering partitions. This paper describes the R package which performs subset selection for model-based clustering. An improved version of the Raftery and Dean (2006) methodology is implemented in the new release of the package to find the (locally) optimal subset of variables with group/cluster information in a dataset. Search over the solution space is performed using either a step-wise greedy search or a headlong algorithm. Adjustments for speeding up these algorithms are discussed, as well as a parallel implementation of the stepwise search. Usage of the package is presented through the discussion of several data examples.
Optimum Allocation for Adaptive Multi-Wave Sampling in R: The R Package optimall
The R package offers a collection of functions that efficiently streamline the design process of sampling in surveys ranging from simple to complex. The package's main functions allow users to interactively define and adjust strata cut points based on values or quantiles of auxiliary covariates, adaptively calculate the optimum number of samples to allocate to each stratum using Neyman or Wright allocation, and select specific units to sample based on a stratified sampling design. Using real-life epidemiological study examples, we demonstrate how facilitates an efficient workflow for the design and implementation of surveys in R. Although tailored towards multi-wave sampling under two- or three-phase designs, the R package may be useful for any sampling survey.
Near-Far Matching in R: The nearfar Package
Estimating the causal treatment effect of an intervention using observational data is difficult due to unmeasured confounders. Many analysts use instrumental variables (IVs) to introduce a randomizing element to observational data analysis, potentially reducing bias created by unobserved confounders. Several persistent problems in the field have served as limitations to IV analyses, particularly the prevalence of "weak" IVs, or instrumental variables that do not effectively randomize individuals to the intervention or control group (leading to biased and unstable treatment effect estimates), as well as IV-based estimates being highly model dependent, requiring parametric adjustment for measured confounders, and often having high mean squared errors in the estimated causal effects. To overcome these problems, the study design method of "near-far matching" has been devised, which "filters" data from a cohort by simultaneously matching individuals within the cohort to be "near" (similar) on measured confounders and "far" (different) on levels of an IV. To facilitate the application of near-far matching to analytical problems, we introduce the R package and illustrate its application to both a classical example and a simulated dataset. We illustrate how the package can be used to "strengthen" a weak IV by adjusting the "near-ness" and "far-ness" of a match, reduce model dependency, enable nonparametric adjustment for measured confounders, and lower mean squared error in estimated causal effects. We additionally illustrate how to utilize the package when analyzing either continuous or binary treatments, how to prioritize variables in the match, and how to calculate statistics of IV strength with or without adjustment for measured confounders.
General Semiparametric Shared Frailty Model: Estimation and Simulation with frailtySurv
The R package for simulating and fitting semi-parametric shared frailty models is introduced. Package implements semi-parametric consistent estimators for a variety of frailty distributions, including gamma, log-normal, inverse Gaussian and power variance function, and provides consistent estimators of the standard errors of the parameters' estimators. The parameters' estimators are asymptotically normally distributed, and therefore statistical inference based on the results of this package, such as hypothesis testing and confidence intervals, can be performed using the normal distribution. Extensive simulations demonstrate the flexibility and correct implementation of the estimator. Two case studies performed with publicly available datasets demonstrate applicability of the package. In the Diabetic Retinopathy Study, the onset of blindness is clustered by patient, and in a large hard drive failure dataset, failure times are thought to be clustered by the hard drive manufacturer and model.
LocalControl: An R Package for Comparative Safety and Effectiveness Research
The R package implements novel approaches to address biases and confounding when comparing treatments or exposures in observational studies of outcomes. While designed and appropriate for use in comparative safety and effectiveness research involving medicine and the life sciences, the package can be used in other situations involving outcomes with multiple confounders. is an open-source tool for researchers whose aim is to generate high quality evidence using observational data. The package implements a family of methods for non-parametric bias correction when comparing treatments in observational studies, including survival analysis settings, where competing risks and/or censoring may be present. The approach extends to bias-corrected personalized predictions of treatment outcome differences, and analysis of heterogeneity of treatment effect-sizes across patient subgroups.
NeuralNetTools: Visualization and Analysis Tools for Neural Networks
Supervised neural networks have been applied as a machine learning technique to identify and predict emergent patterns among multiple variables. A common criticism of these methods is the inability to characterize relationships among variables from a fitted model. Although several techniques have been proposed to "illuminate the black box", they have not been made available in an open-source programming environment. This article describes the package that can be used for the interpretation of supervised neural network models created in R. Functions in the package can be used to visualize a model using a neural network interpretation diagram, evaluate variable importance by disaggregating the model weights, and perform a sensitivity analysis of the response variables to changes in the input variables. Methods are provided for objects from many of the common neural network packages in R, including caret, , , and . The article provides a brief overview of the theoretical foundation of neural networks, a description of the package structure and functions, and an applied example to provide a context for model development with . Overall, the package provides a toolset for neural networks that complements existing quantitative techniques for data-intensive exploration.
Application of Equal Local Levels to Improve Q-Q Plot Testing Bands with R Package qqconf
Quantile-Quantile (Q-Q) plots are often difficult to interpret because it is unclear how large the deviation from the theoretical distribution must be to indicate a lack of fit. Most Q-Q plots could benefit from the addition of meaningful global testing bands, but the use of such bands unfortunately remains rare because of the drawbacks of current approaches and packages. These drawbacks include incorrect global Type I error rate, lack of power to detect deviations in the tails of the distribution, relatively slow computation for large data sets, and limited applicability. To solve these problems, we apply the equal local levels global testing method, which we have implemented in the R Package , a versatile tool to create Q-Q plots and probability-probability (P-P) plots in a wide variety of settings, with simultaneous testing bands rapidly created using recently-developed algorithms. can easily be used to add global testing bands to Q-Q plots made by other packages. In addition to being quick to compute, these bands have a variety of desirable properties, including accurate global levels, equal sensitivity to deviations in all parts of the null distribution (including the tails), and applicability to a range of null distributions. We illustrate the use of in several applications: assessing normality of residuals from regression, assessing accuracy of values, and use of Q-Q plots in genome-wide association studies.
PResiduals: An R Package for Residual Analysis Using Probability-Scale Residuals
We present the R package for residual analysis using the probability-scale residual. This residual is well defined for a wide variety of outcome types and models, including some settings where other popular residuals are not applicable. It can be used for model diagnostics, tests of conditional associations, and covariate-adjustment for Spearman's rank correlation. These tests and measures of conditional association are applicable to any orderable variable. They use order information but do not require assigning scores to ordered categorical variables or transforming continuous variables, and therefore, can achieve a good balance between robustness and efficiency. We illustrate the usage of the package with a publicly available dataset.
Elastic Net Regularization Paths for All Generalized Linear Models
The lasso and elastic net are popular regularized regression models for supervised learning. Friedman, Hastie, and Tibshirani (2010) introduced a computationally efficient algorithm for computing the elastic net regularization path for ordinary least squares regression, logistic regression and multinomial logistic regression, while Simon, Friedman, Hastie, and Tibshirani (2011) extended this work to Cox models for right-censored data. We further extend the reach of the elastic net-regularized regression to all generalized linear model families, Cox models with (start, stop] data and strata, and a simplified version of the relaxed lasso. We also discuss convenient utility functions for measuring the performance of these fitted models.
Probabilistic Estimation and Projection of the Annual Total Fertility Rate Accounting for Past Uncertainty: A Major Update of the bayesTFR R Package
The package for R provides a set of functions to produce probabilistic projections of the total fertility rates (TFR) for all countries, and is widely used, including as part of the basis for the UN's official population projections for all countries. Liu and Raftery (2020) extended the theoretical model by adding a layer that accounts for the past TFR estimation uncertainty. A major update of implements the new extension. Moreover, a new feature of producing annual TFR estimation and projections extends the existing functionality of estimating and projecting for five-year time periods. An additional autoregressive component has been developed in order to account for the larger autocorrelation in the annual version of the model. This article summarizes the updated model, describes the basic steps to generate probabilistic estimation and projections under different settings, compares performance, and provides instructions on how to summarize, visualize and diagnose the model results.
BoXHED2.0: Scalable Boosting of Dynamic Survival Analysis
Modern applications of survival analysis increasingly involve time-dependent covariates. The Python package (osted eact azard stimator with ynamic covariates) is a tree-boosted hazard estimator that is fully nonparametric, and is applicable to survival settings far more general than right-censoring, including recurring events and competing risks. is also scalable to the point of being on the same order of speed as parametric boosted survival models, in part because its core is written in C++ and it also supports the use of GPUs and multicore CPUs. is available from PyPI and also from www.github.com/BoXHED.
Regression Modeling for Recurrent Events Possibly with an Informative Terminal Event Using R Package reReg
Recurrent event analyses have found a wide range of applications in biomedicine, public health, and engineering, among others, where study subjects may experience a sequence of event of interest during follow-up. The R package offers a comprehensive collection of practical and easy-to-use tools for regression analysis of recurrent events, possibly with the presence of an informative terminal event. The regression framework is a general scale-change model which encompasses the popular Cox-type model, the accelerated rate model, and the accelerated mean model as special cases. Informative censoring is accommodated through a subject-specific frailty without any need for parametric specification. Different regression models are allowed for the recurrent event process and the terminal event. Also included are visualization and simulation tools.
BayesCTDesign: An R Package for Bayesian Trial Design Using Historical Control Data
This article introduces the R (R Core Team 2019) package for two-arm randomized Bayesian trial design using historical control data when available, and simple two-arm randomized Bayesian trial design when historical control data is not available. The package , which is available on CRAN, has two simulation functions, historic_sim() and simple_sim() for studying trial characteristics under user defined scenarios, and two methods print() and plot() for displaying summaries of the simulated trial characteristics. The package works with two-arm trials with equal sample sizes per arm. The package allows a user to study Gaussian, Poisson, Bernoulli, Weibull, Lognormal, and Piecewise Exponential (pwe) outcomes. Power for two-sided hypothesis tests at a user defined alpha is estimated via simulation using a test within each simulation replication that involves comparing a 95% credible interval for the outcome specific treatment effect measure to the null case value. If the 95% credible interval excludes the null case value, then the null hypothesis is rejected, else the null hypothesis is accepted. In the article, the idea of including historical control data in a Bayesian analysis is reviewed, the estimation process of is explained, and the user interface is described. Finally, the is illustrated via several examples.
