Statistical Inferences for Missing Response Problems Based on Modified Empirical Likelihood
In this paper, we advance the application of empirical likelihood (EL) for missing response problems. Inspired by remedies for the shortcomings of EL for parameter hypothesis testing, we modify the EL approach used for statistical inference on the mean response when the response is subject to missing behavior. We propose consistent mean estimators, and associated confidence intervals. We extend the approach to estimate the average treatment effect in causal inference settings. We detail the analogous estimators for average treatment effect, prove their consistency, and example their use in estimating the average effect of smoking on renal function of the patients with atherosclerotic renal-artery stenosis and elevated blood pressure, chronic kidney disease, or both. Our proposed estimators outperform the historical mean estimators under missing responses and causal inference settings in terms of simulated relative RMSE and coverage probability on average.
Finite mixtures of mean-parameterized Conway-Maxwell-Poisson models
For modeling count data, the Conway-Maxwell-Poisson (CMP) distribution is a popular generalization of the Poisson distribution due to its ability to characterize data over- or under-dispersion. While the classic parameterization of the CMP has been well-studied, its main drawback is that it is does not directly model the mean of the counts. This is mitigated by using a mean-parameterized version of the CMP distribution. In this work, we are concerned with the setting where count data may be comprised of subpopulations, each possibly having varying degrees of data dispersion. Thus, we propose a finite mixture of mean-parameterized CMP distributions. An EM algorithm is constructed to perform maximum likelihood estimation of the model, while bootstrapping is employed to obtain estimated standard errors. A simulation study is used to demonstrate the flexibility of the proposed mixture model relative to mixtures of Poissons and mixtures of negative binomials. An analysis of dog mortality data is presented.
A method of correction for heaping error in the variables using validation data
When self-reported data are used in statistical analysis to estimate the mean and variance, as well as the regression parameters, the estimates tend, in many cases, to be biased. This is because interviewees have a tendency to heap their answers to certain values. The aim of the paper is to examine the bias-inducing effect of the heaping error in self-reported data, and study the effect on the heaping error on the mean and variance of a distribution as well as the regression parameters. As a result a new method is introduced to correct the effects of bias due to the heaping error using validation data. Using publicly available data and simulation studies, it can be shown that the newly developed method is practical and can easily be applied to correct the bias in the estimated mean and variance, as well as in the estimated regression parameters computed from self-reported data. Hence, using the method of correction presented in this paper allows researchers to draw accurate conclusions leading to the right decisions, e.g. regarding health care planning and delivery.
On the use of historical estimates
The use of historical, i.e., already existing, estimates in current studies is common in a wide variety of application areas. Nevertheless, despite their routine use, the uncertainty associated with historical estimates is rarely properly accounted for in the analysis. In this communication, we review common practices and then provide a mathematical formulation and a principled frequentist methodology for addressing the problem of drawing inferences in the presence of historical estimates. Three distinct variants are investigated in detail; the corresponding limiting distributions are found and compared. The design of future studies, given historical data, is also explored and relations with a variety of other well-studied statistical problems discussed.
Statistical analysis and first-passage-time applications of a lognormal diffusion process with multi-sigmoidal logistic mean
We consider a lognormal diffusion process having a multisigmoidal logistic mean, useful to model the evolution of a population which reaches the maximum level of the growth after many stages. Referring to the problem of statistical inference, two procedures to find the maximum likelihood estimates of the unknown parameters are described. One is based on the resolution of the system of the critical points of the likelihood function, and the other is on the maximization of the likelihood function with the simulated annealing algorithm. A simulation study to validate the described strategies for finding the estimates is also presented, with a real application to epidemiological data. Special attention is also devoted to the first-passage-time problem of the considered diffusion process through a fixed boundary.
Multiple change point detection and validation in autoregressive time series data
It is quite common that the structure of a time series changes abruptly. Identifying these change points and describing the model structure in the segments between these change points is of interest. In this paper, time series data is modelled assuming each segment is an autoregressive time series with possibly different autoregressive parameters. This is achieved using two main steps. The first step is to use a likelihood ratio scan based estimation technique to identify these potential change points to segment the time series. Once these potential change points are identified, modified parametric spectral discrimination tests are used to validate the proposed segments. A numerical study is conducted to demonstrate the performance of the proposed method across various scenarios and compared against other contemporary techniques.
Using SeDuMi to find various optimal designs for regression models
We introduce a powerful and yet seldom used numerical approach in statistics for solving a broad class of optimization problems where the search space is discretized. This optimization tool is widely used in engineering for solving semidefinite programming (SDP) problems and is called SeDuMi (self-dual minimization). We focus on optimal design problems and demonstrate how to formulate A-, A -, c-, I-, and L-optimal design problems as SDP problems and show how they can be effectively solved by SeDuMi in MATLAB. We also show the numerical approach is flexible by applying it to further find optimal designs based on the weighted least squares estimator or when there are constraints on the weight distribution of the sought optimal design. For approximate designs, the optimality of the SDP-generated designs can be verified using the Kiefer-Wolfowitz equivalence theorem. SDP also finds optimal designs for nonlinear regression models commonly used in social and biomedical research. Several examples are presented for linear and nonlinear models.
Penalized and constrained LAD estimation in fixed and high dimension
Recently, many literatures have proved that prior information and structure in many application fields can be formulated as constraints on regression coefficients. Following these work, we propose a penalized LAD estimation with some linear constraints in this paper. Different from constrained lasso, our estimation performs well when heavy-tailed errors or outliers are found in the response. In theory, we show that the proposed estimation enjoys the Oracle property with adjusted normal variance when the dimension of the estimated coefficients is fixed. And when is much greater than the sample size , the error bound of proposed estimation is sharper than . It is worth noting the result is true for a wide range of noise distribution, even for the Cauchy distribution. In algorithm, we not only consider an typical linear programming to solve proposed estimation in fixed dimension , but also present an nested alternating direction method of multipliers (ADMM) in high dimension. Simulation and application to real data also confirm that proposed estimation is an effective alternative when constrained lasso is unreliable.
Kronecker delta method for testing independence between two vectors in high-dimension
Conventional methods for testing independence between two Gaussian vectors require sample sizes greater than the number of variables in each vector. Therefore, adjustments are needed for the high-dimensional situation, where the sample size is smaller than the number of variables in at least one of the compared vectors. It is critical to emphasize that the methods available in the literature are unable to control the Type I error probability under the nominal level. This fact is evidenced through an intensive simulation study presented in this paper. To cover this lack, we introduce a valid randomized test based on the Kronecker delta covariance matrices estimator. As an empirical application, based on a sample of companies listed on the stock exchange of Brazil, we test the independence between returns of stocks of different sectors in the COVID-19 pandemic context.
Asymptotic analysis of reliability measures for an imperfect dichotomous test
To access the reliability of a new dichotomous test and to capture the random variability of its results in the absence of a gold standard, two measures, the inconsistent acceptance probability (IAP) and inconsistent rejection probability (IRP), were introduced in the literature. In this paper, we first analyze the limiting behavior of both measures as the number of test repetitions increases and derive the corresponding accuracy estimates and rates of convergence. To overcome possible limitations of IRP and IAP, we then introduce a one-parameter family of refined reliability measures, . Such measures characterize the consistency of the results of a dichotomous test in the absence of a gold standard as the threshold for a positive aggregate test result varies. Similar to IRP and IAP, we also derive corresponding accuracy estimates and rates of convergence for as the number of test repetitions increases.
Testing for equality of distributions using the concept of (niche) overlap
In this paper, we propose a new non-parametric test for equality of distributions. The test is based on the recently introduced measure of (niche) overlap and its rank-based estimator. As the estimator makes only one basic assumption on the underlying distribution, namely continuity, the test is universal applicable in contrast to many tests that are restricted to only specific scenarios. By construction, the new test is capable of detecting differences in location and scale. It thus complements the large class of rank-based tests that are constructed based on the non-parametric relative effect. In simulations this new test procedure obtained higher power and lower type I error compared to two common tests in several settings. The new procedure shows overall good performance. Together with its simplicity, this test can be used broadly.
Copula-based measures of asymmetry between the lower and upper tail probabilities
We propose a copula-based measure of asymmetry between the lower and upper tail probabilities of bivariate distributions. The proposed measure has a simple form and possesses some desirable properties as a measure of asymmetry. The limit of the proposed measure as the index goes to the boundary of its domain can be expressed in a simple form under certain conditions on copulas. A sample analogue of the proposed measure for a sample from a copula is presented and its weak convergence to a Gaussian process is shown. Another sample analogue of the presented measure, which is based on a sample from a distribution on , is given. Simple methods for interval and region estimation are presented. A simulation study is carried out to investigate the performance of the proposed sample analogues and methods for interval estimation. As an example, the presented measure is applied to daily returns of S&P500 and Nikkei225. A trivariate extension of the proposed measure and its sample analogue are briefly discussed.
Epidemic changepoint detection in the presence of nuisance changes
Many time series problems feature epidemic changes-segments where a parameter deviates from a background baseline. Detection of such changepoints can be improved by accounting for the epidemic structure, but this is currently difficult if the background level is unknown. Furthermore, in practical data the background often undergoes nuisance changes, which interfere with standard estimation techniques and appear as false alarms. To solve these issues, we develop a new, efficient approach to simultaneously detect epidemic changes and estimate unknown, but fixed, background level, based on a penalised cost. Using it, we build a two-level detector that models and separates nuisance and signal changes. The analytic and computational properties of the proposed methods are established, including consistency and convergence. We demonstrate via simulations that our two-level detector provides accurate estimation of changepoints under a nuisance process, while other state-of-the-art detectors fail. In real-world genomic and demographic datasets, the proposed method identified and localised target events while separating out seasonal variations and experimental artefacts.
Compositional cubes: a new concept for multi-factorial compositions
Compositional data are commonly known as multivariate observations carrying relative information. Even though the case of vector or even two-factorial compositional data (compositional tables) is already well described in the literature, there is still a need for a comprehensive approach to the analysis of multi-factorial relative-valued data. Therefore, this contribution builds around the current knowledge about compositional data a general theoretical framework for -factorial compositional data. As a main finding it turns out that, similar to the case of compositional tables, also the multi-factorial structures can be orthogonally decomposed into an independent and several interactive parts and, moreover, a coordinate representation allowing for their separate analysis by standard analytical methods can be constructed. For the sake of simplicity, these features are explained in detail for the case of three-factorial compositions (compositional cubes), followed by an outline covering the general case. The three-dimensional structure is analyzed in depth in two practical examples, dealing with systems of spatial and time dependent compositional cubes. The methodology is implemented in the R package robCompositions.
Maximum likelihood estimation under the Emax model: existence, geometry and efficiency
This study focuses on the estimation of the Emax dose-response model, a widely utilized framework in clinical trials, experiments in pharmacology, agriculture, environmental science, and more. Existing challenges in obtaining maximum likelihood estimates (MLE) for model parameters are often ascribed to computational issues but, in reality, stem from the absence of a MLE. Our contribution provides new understanding and control of all the experimental situations that practitioners might face, guiding them in the estimation process. We derive the exact MLE for a three-point experimental design and identify the two scenarios where the MLE fails to exist. To address these challenges, we propose utilizing Firth's modified score, which we express analytically as a function of the experimental design. Through a simulation study, we demonstrate that the Firth modification yields a finite estimate in one of the problematic scenarios. For the remaining case, we introduce a design-augmentation strategy akin to a hypothesis test.
Local linear smoothing for regression surfaces on the simplex using Dirichlet kernels
This paper introduces a local linear smoother for regression surfaces on the simplex. The estimator solves a least-squares regression problem weighted by a locally adaptive Dirichlet kernel, ensuring good boundary properties. Asymptotic results for the bias, variance, mean squared error, and mean integrated squared error are derived, generalizing the univariate results of Chen (Ann Inst Stat Math, 54(2):312-323, 2002). A simulation study shows that the proposed local linear estimator with Dirichlet kernel outperforms its only direct competitor in the literature, the Nadaraya-Watson estimator with Dirichlet kernel due to Bouzebda et al. (AIMS Math 9(9):26195-26282, 2024).
On some problems of Bayesian region construction with guaranteed coverages
The general problem of constructing regions that have a guaranteed coverage probability for an arbitrary parameter of interest is considered. The regions developed are Bayesian in nature and the coverage probabilities can be considered as Bayesian confidences with respect to the model obtained by integrating out the nuisance parameters using the conditional prior given Both the prior coverage probability and the prior probability of covering a false value (the accuracy) can be controlled by setting the sample size. These coverage probabilities are considered as a priori figures of merit concerning the reliability of a study while the inferences quoted are Bayesian. Several problems are considered where obtaining confidence regions with desirable properties have proven difficult to obtain. For example, it is shown that the approach discussed never leads to improper regions which has proven to be an issue for some confidence regions.
Discrimination between Gaussian process models: active learning and static constructions
The paper covers the design and analysis of experiments to discriminate between two Gaussian process models with different covariance kernels, such as those widely used in computer experiments, kriging, sensor location and machine learning. Two frameworks are considered. First, we study sequential constructions, where successive design (observation) points are selected, either as additional points to an existing design or from the beginning of observation. The selection relies on the maximisation of the difference between the symmetric Kullback Leibler divergences for the two models, which depends on the observations, or on the mean squared error of both models, which does not. Then, we consider static criteria, such as the familiar log-likelihood ratios and the Fréchet distance between the covariance functions of the two models. Other distance-based criteria, simpler to compute than previous ones, are also introduced, for which, considering the framework of approximate design, a necessary condition for the optimality of a design measure is provided. The paper includes a study of the mathematical links between different criteria and numerical illustrations are provided.
Osband's principle for identification functions
Given a statistical functional of interest such as the mean or median, a (strict) identification function is zero in expectation at (and only at) the true functional value. Identification functions are key objects in forecast validation, statistical estimation and dynamic modelling. For a possibly vector-valued functional of interest, we fully characterise the class of (strict) identification functions subject to mild regularity conditions.
A new approach for estimating VAR systems in the mixed-frequency case
In this paper we present a new estimation procedure named MF-IVL for VAR systems in the case of mixed-frequency data, where the data maybe, e.g., stock or flow data. The main idea of this new procedure is to project the slow components on the present and past fast ones in order to create instrumental variables. This procedure is shown to be generically consistent. Our claim is that the procedure is fast and more accurate when compared to the extended Yule-Walker procedure. A comparison of these two procedures is given by simulation.
Handling skewness and directional tails in model-based clustering
Model-based clustering is a powerful approach used in data analysis to unveil underlying patterns or groups within a data set. However, when applied to clusters that exhibit skewness, heavy tails, or both, the classification of data points becomes more challenging. In this study, we introduce two models considering two component-wise transformations of the observed data within a mixture of multiple scaled contaminated normal (MSCN) distributions. MSCN distributions are designed to enable a different tail behavior in each dimension and directional outlier detection in the direction of the principal components. Using the transformed MSCN distributions as components of a mixture, we obtain model-based clustering techniques that allow for 1) flexible cluster shapes in terms of skewness and kurtosis and 2) component-wise and directional outlier detection. We assess the efficacy of the proposed techniques by comparing them with model-based clustering methods that perform global or component-wise outlier detection using simulated and real data sets. This comparative analysis aims to demonstrate which practical clustering scenarios using the proposed MSCN-based approaches are advantageous.
