Statistical Methods and Applications

Hypothesis Tests of Indirect Effects for Multiple Mediators
Kidd J, Howard AG, Highland HM, Gordon-Larsen P, Bancks MP, Carnethon M and Lin DY
Mediation analysis seeks to determine whether an independent variable affects a response directly or whether it does so indirectly, by way of a mediator or mediators. Scenarios that assume a single mediation are often overly simplistic, and analyses that include multiple mediators are becoming more common, particularly with the incorporation of high-dimensional data. Surprisingly, however, little attention has been given to multiple mediator and interaction effects. In this article, we propose new methods for testing the null hypothesis of no indirect effect with multiple mediators and interaction effects. We allow the estimators of the path effects to be possibly correlated; we also consider the practice of using confidence intervals to determine whether a mediation effect is zero. We compare the performance of our proposed method with existing methods through extensive simulation studies. Finally, we provide an application to data from the Coronary Artery Risk Development in Young Adults (CARDIA) study.
Sequential adaptive strategies for sampling rare clustered populations
Mecatti F, Sismanidis C, Furfaro E and Conti PL
A new class of sampling strategies is proposed that can be applied to population-based surveys targeting a rare trait that is unevenly spread over an area of interest. Our proposal is characterised by the ability to tailor the data collection to specific features and challenges of the survey at hand. It is based on integrating an adaptive component into a sequential selection, which aims both to intensify the detection of positive cases, upon exploiting the spatial clustering, and to provide a flexible framework to manage logistics and budget constraints. A class of estimators is also proposed to account for the selection bias, that are proved unbiased for the population mean (prevalence) as well as consistent and asymptotically Normal distributed. Unbiased variance estimation is also provided. A ready-to-implement weighting system is developed for estimation purposes. Two special strategies included in the proposed class are presented, that are based on the Poisson sampling and proved more efficient. The selection of primary sampling units is also illustrated for tuberculosis prevalence surveys, which are recommended in many countries and supported by the World Health Organisation as an emblematic example of the need for an improved sampling design. Simulation results are given in the tuberculosis application to illustrate the strengths and weaknesses of the proposed sequential adaptive sampling strategies with respect to traditional cross-sectional non-informative sampling as currently suggested by World Health Organisation guidelines.
Correction to: Using sentiment analysis to evaluate the impact of the COVID-19 outbreak on Italy's country reputation and stock market performance
Zammarchi G, Mola F and Conversano C
[This corrects the article DOI: 10.1007/s10260-023-00690-5.].
Using poverty maps to improve the design of household surveys: the evidence from Tunisia
Betti G, Molini V and Pavelesku D
In this paper we aim to propose a new method for improving the design effect of household surveys based on a two-stage design in which the first stage clusters, or Primary Selection Units (PSUs), are stratified along administrative boundaries. Improvement of the design effect can result in more precise survey estimates (smaller standard errors and confidence intervals) or in the reduction of the necessary sample size, i.e. a reduction in the budget needed for a survey. The proposed method is based on the availability of a previously conducted poverty maps, i.e. spatial descriptions of the distribution of per capita consumption expenditures, that are finely disaggregated in small geographic units, such as cities, municipalities, districts or other administrative partitions of a country that are directly linked to PSUs. Such information is then used to select PSUs with systematic sampling by introducing further in the survey design, so as to maximise the improvement of the design effect. Since per capita consumption expenditures estimated at PSU level from the poverty mapping are affected by (small) standard errors, in the paper we also perform a simulation study in order to take into account this addition variability.
Using sentiment analysis to evaluate the impact of the COVID-19 outbreak on Italy's country reputation and stock market performance
Zammarchi G, Mola F and Conversano C
During the recent Coronavirus disease 2019 (COVID-19) outbreak, the microblogging service Twitter has been widely used to share opinions and reactions to events. Italy was one of the first European countries to be severely affected by the outbreak and to establish lockdown and stay-at-home orders, potentially leading to country reputation damage. We resort to sentiment analysis to investigate changes in opinions about Italy reported on Twitter before and after the COVID-19 outbreak. Using different lexicons-based methods, we find a breakpoint corresponding to the date of the first established case of COVID-19 in Italy that causes a relevant change in sentiment scores used as a proxy of the country's reputation. Next, we demonstrate that sentiment scores about Italy are associated with the values of the FTSE-MIB index, the Italian Stock Exchange main index, as they serve as early detection signals of changes in the values of FTSE-MIB. Lastly, we evaluate whether different machine learning classifiers were able to determine the polarity of tweets posted before and after the outbreak with a different level of accuracy.
Optimal two-stage spatial sampling design for estimating critical parameters of SARS-CoV-2 epidemic: Efficiency versus feasibility
Alleva G, Arbia G, Falorsi PD, Nardelli V and Zuliani A
The COVID-19 pandemic presents an unprecedented clinical and healthcare challenge for the many medical researchers who are attempting to prevent its worldwide spread. It also presents a challenge for statisticians involved in designing appropriate sampling plans to estimate the crucial parameters of the pandemic. These plans are necessary for monitoring and surveillance of the phenomenon and evaluating health policies. In this respect, we can use spatial information and aggregate data regarding the number of verified infections (either hospitalized or in compulsory quarantine) to improve the standard two-stage sampling design broadly adopted for studying human populations. We present an optimal spatial sampling design based on spatially balanced sampling techniques. We prove its relative performance analytically in comparison to other competing sampling plans, and we also study its properties through a series of Monte Carlo experiments. Considering the optimal theoretical properties of the proposed sampling plan and its feasibility, we discuss suboptimal designs that approximate well optimality and are more readily applicable.
Perceived climate change risk and global green activism among young people
D'Uggento AM, Piscitelli A, Ribecco N and Scepi G
In recent years, the increasing number of natural disasters has raised concerns about the sustainability of our planet's future. As young people comprise the generation that will suffer from the negative effects of climate change, they have become involved in a new climate activism that is also gaining interest in the public debate thanks to the Fridays for Future (FFF) movement. This paper analyses the results of a survey of 1,138 young people in a southern Italian region to explore their perceptions of the extent of environmental problems and their participation in protests of green movements such as the FFF. The statistical analyses perform an ordinal classification tree using an original impurity measure considering both the ordinal nature of the response variable and the heterogeneity of its ordered categories. The results show that respondents are concerned about the threat of climate change and participate in the FFF to claim their right to a healthier planet and encourage people to adopt environmentally friendly practices in their lifestyles. Young people feel they are global citizens, connected through the Internet and social media, and show greater sensitivity to the planet's environmental problems, so they are willing to take effective action to demand sustainable policies from decision-makers. When planning public policies that will affect future generations, it is important for policymakers to know the demands and opinions of key stakeholders, especially young people, in order to plan the most appropriate measures, such as climate change mitigation.
Hierarchical Bayes small area estimation for county-level health prevalence to having a personal doctor
Andreea L E, Jianzhu L, Tom K and Machell T
The complexity of survey data and the availability of data from auxiliary sources motivate researchers to explore estimation methods that extend beyond traditional survey-based estimation. The U.S. Centers for Disease Control and Prevention's Behavioral Risk Factor Surveillance System (BRFSS) collects a wide range of health information, including whether respondents have a personal doctor. While the BRFSS focuses on state-level estimation, there is demand for county-level estimation of health indicators using BRFSS data. A hierarchical Bayes small area estimation model is developed to combine county-level BRFSS survey data with county-level data from auxiliary sources, while accounting for various sources of error and nested geographical levels. To mitigate extreme proportions and unstable survey variances, a transformation is applied to the survey data. Model-based county-level predictions are constructed for prevalence of having a personal doctor for all the counties in the U.S., including those where BRFSS survey data were not available. An evaluation study using only the counties with large BRFSS sample sizes to fit the model versus using all the counties with BRFSS data to fit the model is also presented.
The relative importance of ability, luck and motivation in team sports: a Bayesian model of performance in the English Rugby Premiership
Fioravanti F, Delbianco F and Tohmé F
Results in contact sports like Rugby are mainly interpreted in terms of the ability and/or luck of teams. But this neglects the important role of the of players, reflected in the effort exerted in the game. Here we present a Bayesian hierarchical model to infer the main features that explain score differences in rugby matches of the English Premiership Rugby 2020/2021 season. The main result is that, indeed, (seen as a ratio between the number of tries and the scoring kick attempts) is highly relevant to explain outcomes in those matches.
A new measure for the attitude to mobility of Italian students and graduates: a topological data analysis approach
Vittorietti M, Giambalvo O, Genova VG and Aiello F
Students' and graduates' mobility is an interesting topic of discussion especially for the Italian education system and universities. The main reasons for migration and for the so called brain drain, can be found in the socio-economic context and in the famous North-South divide. Measuring mobility and understanding its dynamic over time and space are not trivial tasks. Most of the studies in the related literature focus on the determinants of such phenomenon, in this paper, instead, combining tools coming from graph theory and Topological Data Analysis we propose a new measure for the attitude to mobility. Each mobility trajectory is represented by a graph and the importance of the features constituting the graph are evaluated over time using persistence diagrams. The attitude to mobility of the students is then ranked computing the distance between the individual persistence diagram and the theoretical persistence diagram of the stayer student. The new approach is used for evaluating the mobility of the students that in 2008 enrolled in an Italian university. The relation between attitude to mobility and the main socio-demographic variables is investigated.
Quantile regression for count data: jittering versus regression coefficients modelling in the analysis of credits earned by university students after remote teaching
Carcaiso V and Grilli L
The extension of quantile regression to count data raises several issues. We compare the traditional approach, based on transforming the count variable using jittering, with a recently proposed approach in which the coefficients of quantile regression are modelled by parametric functions. We exploit both methods to analyse university students' data to evaluate the effect of emergency remote teaching due to COVID-19 on the number of credits earned by the students. The coefficients modelling approach performs a smoothing that is especially convenient in the tails of the distribution, preventing abrupt changes in the point estimates and increasing precision. Nonetheless, model selection is challenging because of the wide range of options and the limited availability of diagnostic tools. Thus the jittering approach remains fundamental to guide the choice of the parametric functions.
When does morbidity start? An analysis of changes in morbidity between 2013 and 2019 in Italy
Pastore A, Tonellato SF, Aliverti E and Campostrini S
Morbidity is one of the key aspects for assessing populations' well-being. In particular, chronic diseases negatively affect the quality of life in the old age and the risk that more years added to lives are years of disability and illness. Novel analysis, interventions and policies are required to understand and potentially mitigate this issue. In this article, we focus on investigating whether in Italy the compression of morbidity is in act in the recent years, parallely to an increase of life expectancy. Our analysis rely on large repeated cross-sectional data from the national surveillance system passi, providing deep insights on the evolution of morbidity together with other socio-demographical variables. In addition, we investigate differences in morbidity across subgroups, focusing on disparities by gender, level of education and economic difficulties, and assessing the evolution of these differences across the period 2013-2019.
When did coronavirus arrive in Europe?
Cerqua A and Di Stefano R
The first cluster of coronavirus cases in Europe was officially detected on 21st February 2020 in Northern Italy, even if recent evidence showed sporadic first cases in Europe since the end of 2019. In this study, we have tested the presence of coronavirus in Italy and, even more importantly, we have assessed whether the virus had already spread sooner than 21st February. We use a counterfactual approach and certified daily data on the number of deaths (deaths from any cause, not only related to coronavirus) at the municipality level. Our estimates confirm that coronavirus began spreading in Northern Italy in mid-January.
Support provided by elderly in Italy: a hierarchical analysis of ego networks controlling for alter-overlapping
Pelle E, Zaccarin S, Furfaro E and Rivellini G
Providing support outside the household can be considered an actual sign of an active social life for the elderly. Adopting an ego-network perspective, we study support Italian elders provide to kin or non-kin. More specifically, using Italian survey data, we build the ego-centered networks of social contacts elders entertain and the ego-networks of support elders provide to other non-cohabitant kin or non-kin. Since ego-network data are inherently multilevel, we use Bayesian multilevel models to analyze variation in support ties, controlling for the characteristics of elders and their contacts. This modeling strategy enables dealing with sparseness and alter-alter overlap in the ego support network data and to disentangle the effects related to the ego (the elder), the dyad ego-alter, the kind of support provided, as well as social contacts and contextual variables. The results suggest that the elderly in Italy who provide support outside their household - compared to all elders in the sample - are younger, healthier, more educated, and embedded in a more diversified ego-network of social contacts. The latter also conveys both the type and the recipient of the support, with the elderly who entertain few relationships with kin being more prone to provide aid to non-kin. Further, a "peer homophily" effect in directing elder support to a non-kin is also found.
Online network monitoring
Malinovskaya A and Otto P
An important problem in network analysis is the online detection of anomalous behaviour. In this paper, we introduce a network surveillance method bringing together network modelling and statistical process control. Our approach is to apply multivariate control charts based on exponential smoothing and cumulative sums in order to monitor networks generated by temporal exponential random graph models (TERGM). The latter allows us to account for temporal dependence while simultaneously reducing the number of parameters to be monitored. The performance of the considered charts is evaluated by calculating the average run length and the conditional expected delay for both simulated and real data. To justify the decision of using the TERGM to describe network data, some measures of goodness of fit are inspected. We demonstrate the effectiveness of the proposed approach by an empirical application, monitoring daily flights in the United States to detect anomalous patterns.
Semiautomatic robust regression clustering of international trade data
Torti F, Riani M and Morelli G
The purpose of this paper is to show in regression clustering how to choose the most relevant solutions, analyze their stability, and provide information about best combinations of optimal number of groups, restriction factor among the error variance across groups and level of trimming. The procedure is based on two steps. First we generalize the information criteria of constrained robust multivariate clustering to the case of clustering weighted models. Differently from the traditional approaches which are based on the choice of the best solution found minimizing an information criterion (i.e. BIC), we concentrate our attention on the so called optimal stable solutions. In the second step, using the monitoring approach, we select the best value of the trimming factor. Finally, we validate the solution using a confirmatory forward search approach. A motivating example based on a novel dataset concerning the European Union trade of face masks shows the limitations of the current existing procedures. The suggested approach is initially applied to a set of well known datasets in the literature of robust regression clustering. Then, we focus our attention on a set of international trade datasets and we provide a novel informative way of updating the subset in the random start approach. The Supplementary material, in the spirit of the Special Issue, deepens the analysis of trade data and compares the suggested approach with the existing ones available in the literature.
Bayesian dynamic network actor models with application to South Korean COVID-19 patient movement data
Arrizza AM and Caimo A
Motivated by the ongoing COVID-19 pandemic, this article introduces Bayesian dynamic network actor models for the analysis of infected individuals' movements in South Korea during the first three months of 2020. The relational event data modelling framework makes use of network statistics capturing the structure of movement events from and to several country's municipalities. The fully probabilistic Bayesian approach allows to quantify the uncertainty associated to the relational tendencies explaining where and when movement events are established and where they are directed. The observed patient movements' patterns at an early stage of the pandemic can provide interesting insights about the spread of the disease in the Asian country.
Special issue on statistical analysis of networks: Preface by the guest editors
Schweinberger M, Stingo FC and Vitale MP
The special issue on aspires to convey the breadth and depth of statistical learning with networks, ranging from networks that are observed to networks that are unobserved and learned from data. It includes ten select papers with methodological and theoretical advances, and demonstrates the usefulness of the network paradigm by applications to current problems.
Weighted stochastic block model
Ng TLJ and Murphy TB
We propose a weighted stochastic block model (WSBM) which extends the stochastic block model to the important case in which edges are weighted. We address the parameter estimation of the WSBM by use of maximum likelihood and variational approaches, and establish the consistency of these estimators. The problem of choosing the number of classes in a WSBM is addressed. The proposed model is applied to simulated data and an illustrative data set.
A COVINDEX based on a GAM beta regression model with an application to the COVID-19 pandemic in Italy
Scrucca L
Detecting changes in COVID-19 disease transmission over time is a key indicator of epidemic growth. Near real-time monitoring of the pandemic growth is crucial for policy makers and public health officials who need to make informed decisions about whether to enforce lockdowns or allow certain activities. The effective reproduction number is the standard index used in many countries for this goal. However, it is known that due to the delays between infection and case registration, its use for decision making is somewhat limited. In this paper a near real-time COVINDEX is proposed for monitoring the evolution of the pandemic. The index is computed from predictions obtained from a GAM beta regression for modelling the test positive rate as a function of time. The proposal is illustrated using data on COVID-19 pandemic in Italy and compared with . A simple chart is also proposed for monitoring local and national outbreaks by policy makers and public health officials.
Bayesian graphical models for modern biological applications
Ni Y, Baladandayuthapani V, Vannucci M and Stingo FC
Graphical models are powerful tools that are regularly used to investigate complex dependence structures in high-throughput biomedical datasets. They allow for holistic, systems-level view of the various biological processes, for intuitive and rigorous understanding and interpretations. In the context of large networks, Bayesian approaches are particularly suitable because it encourages sparsity of the graphs, incorporate prior information, and most importantly account for uncertainty in the graph structure. These features are particularly important in applications with limited sample size, including genomics and imaging studies. In this paper, we review several recently developed techniques for the analysis of large networks under non-standard settings, including but not limited to, multiple graphs for data observed from multiple related subgroups, graphical regression approaches used for the analysis of networks that change with covariates, and other complex sampling and structural settings. We also illustrate the practical utility of some of these methods using examples in cancer genomics and neuroimaging.