EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

Coefficient Lambda for Interrater Agreement Among Multiple Raters: Correction for Category Prevalence
Almehrizi RS
Fleiss's Kappa is an extension of Cohen's Kappa, developed to assess the degree of interrater agreement among multiple raters or methods classifying subjects using categorical scales. Like Cohen's Kappa, it adjusts the observed proportion of agreement to account for agreement expected by chance. However, over time, several paradoxes and interpretative challenges have been identified, largely stemming from the assumption of random chance agreement and the sensitivity of the coefficient to the number of raters. Interpreting Fleiss's Kappa can be particularly difficult due to its dependence on the distribution of categories and prevalence patterns. This paper argues that a portion of the observed agreement may be better explained by the interaction between category prevalence and inherent category characteristics, such as ambiguity, appeal, or social desirability, rather than by chance alone. By shifting away from the assumption of random rater assignment, the paper introduces a novel agreement coefficient that adjusts for the expected agreement by accounting for category prevalence, providing a more accurate measure of interrater reliability in the presence of imbalanced category distributions. It also examines the theoretical justification for this new measure, its interpretability, its standard error, and the robustness of its estimates in simulation and practical applications.
An Evaluation of the Replicable Factor Analytic Solutions Algorithm for Variable Selection: A Simulation Study
Sass DA and Sanchez MA
Observed variable and factor selection are critical components of factor analysis, particularly when the optimal subset of observed variables and the number of factors are unknown and results cannot be replicated across studies. The Replicable Factor Analytic Solutions (RFAS) algorithm was developed to assess the replicability of factor structures-both in terms of the number of factors and the variables retained-while identifying the "best" or most replicable solutions according to predefined criteria. This study evaluated RFAS performance across 54 experimental conditions that varied in model complexity (six-factor models), interfactor correlations (ρ = 0, .30, and .60), and sample sizes ( = 300, 500, and 1000). Under default settings, RFAS generally performed well and demonstrated its utility in producing replicable factor structures. However, performance declined with highly correlated factors, smaller sample sizes, and more complex models. RFAS was also compared to four alternative variable selection methods: Ant Colony Optimization (ACO), Weighted Group Least Absolute Shrinkage and Selection Operator (LASSO), and stepwise procedures based on target Tucker-Lewis Index (TLI) and ΔTLI criteria. Stepwise and LASSO methods were largely ineffective at eliminating problematic variables under the studied conditions. In contrast, both RFAS and ACO successfully removed variables as intended, although the resulting factor structures often differed substantially between the two approaches. As with other variable selection methods, refining algorithmic criteria may be necessary to further enhance model performance.
On the Complex Sources of Differential Item Functioning: A Comparison of Three Methods
Lee H, Huang S, Svetina Valdivia D and Schwartzman B
Differential item functioning (DIF) has been a long-standing problem in educational and psychological measurement. In practice, the source from which DIF originates can be complex in the sense that an item can show DIF on multiple background variables of different types simultaneously. Although a variety of non-item response theory-(IRT)-based and IRT-based DIF detection methods have been introduced, they do not sufficiently address the issue of DIF evaluation when its source is complex. The recently proposed east bsolute hrinkage and election perator (LASSO) regularization method has shown promising results of detecting DIF on multiple background variables. To provide more insight, in this study, we compared three DIF detection methods, including the non-IRT-based logistic regression (LR), the IRT-based likelihood ratio test (LRT), and LASSO regularization, through a comprehensive simulation and an empirical data analysis. We found that when multiple background variables were considered, the Type I error and Power rates of the three methods for identifying DIF items on one of the variables depended on not only the sample size and its DIF magnitude but also on the DIF magnitude of the other background variable and the correlation between them. We presented other findings and discussed the limitations and future research directions in this paper.
Agreement Lambda for Weighted Disagreement With Ordinal Scales: Correction for Category Prevalence
Almehrizi RS
Weighted inter-rater agreement allows for differentiation between levels of disagreement among rating categories and is especially useful when there is an ordinal relationship between categories. Many existing weighted inter-rater agreement coefficients are either extensions of weighted Kappa or are formulated as Cohen's Kappa-like coefficients. These measures suffer from the same issues as Cohen's Kappa, including sensitivity to the marginal distributions of raters and the effects of category prevalence. They primarily account for the possibility of chance agreement or disagreement. This article introduces a new coefficient, weighted Lambda, which allows for the inclusion of varying weights assigned to disagreements. Unlike traditional methods, this coefficient does not assume random assignment and does not adjust for chance agreement or disagreement. Instead, it modifies the observed percentage of agreement while taking into account the anticipated impact of prevalence-agreement effects. The study also outlines techniques for estimating sampling standard errors, conducting hypothesis tests, and constructing confidence intervals for weighted Lambda. Illustrative numerical examples and Monte Carlo simulations are presented to investigate and compare the performance of the new weighted Lambda with commonly used weighted inter-rater agreement coefficients across various true agreement levels and agreement matrices. Results demonstrate several advantages of the new coefficient in measuring weighted inter-rater agreement.
Reliability as Projection in Operator-Theoretic Test Theory: Conditional Expectation, Hilbert Space Geometry, and Implications for Psychometric Practice
Zumbo BD
This article reconceptualizes reliability as a theorem derived from the projection geometry of Hilbert space rather than an assumption of classical test theory. Within this framework, the true score is defined as the conditional expectation , representing the orthogonal projection of the observed score onto the σ-algebra of the latent variable. Reliability, expressed as , quantifies the efficiency of this projection-the squared cosine between and its true-score projection. This formulation unifies reliability with regression , factor-analytic communality, and predictive accuracy in stochastic models. The operator-theoretic perspective clarifies that measurement error corresponds to the orthogonal complement of the projection, and reliability reflects the alignment between observed and latent scores. Numerical examples and measure-theoretic proofs illustrate the framework's generality. The approach provides a rigorous mathematical foundation for reliability, connecting psychometric theory with modern statistical and geometric analysis.
Guessing During Testing is a Person Attribute Not an Instrument Parameter
Sideridis GD and Alghamdi M
The three-parameter logistic (3PL) model in item-response theory (IRT) has long been used to account for guessing in multiple-choice assessments through a fixed item-level parameter. However, this approach treats guessing as a property of the test item rather than the individual, potentially misrepresenting the cognitive processes underlying the examinee's behavior. This study evaluates a novel alternative, the Two-Parameter Logistic Extension (2PLE) model, which re-conceptualizes guessing as a function of a person's ability rather than as an item-specific constant. Using Monte Carlo simulation and empirical data from the PIRLS 2021 reading comprehension assessment, we compared the 3PL and 2PLE models on the recovery of latent ability, predictive fit (Leave-One-Out Information Criterion [LOOIC]), and theoretical alignment with test-taking behavior. The simulation results demonstrated that although both models performed similarly in terms of root-mean-squared error (RMSE) for ability estimates, the 2PLE model consistently achieved superior LOOIC values across conditions, particularly with longer tests and larger sample sizes. In an empirical analysis involving the reading achievement of 131 fourth-grade students from Saudi Arabia, model comparison again favored 2PLE, with a statistically significant LOOIC difference (ΔLOOIC = 0.482, = 2.54). Importantly, person-level guessing estimates derived from the 2PLE model were significantly associated with established person-fit statistics (C*, U3), supporting their criterion validity. These findings suggest that the 2PLE model provides a more cognitively plausible and statistically robust representation of examinee behavior by embedding an ability-dependent guessing function.
Network Approaches to Binary Assessment Data: Network Psychometrics Versus Latent Space Item Response Models
De Carolis L and Jeon M
This study compares two network-based approaches for analyzing binary psychological assessment data: network psychometrics and latent space item response modeling (LSIRM). Network psychometrics, a well-established method, infers relationships among items or symptoms based on pairwise conditional dependencies. In contrast, LSIRM is a more recent framework that represents item responses as a bipartite network of respondents and items embedded in a latent metric space, where the likelihood of a response decreases with increasing distance between the respondent and item. We evaluate the performance of both methods through simulation studies under varying data-generating conditions. In addition, we demonstrate their applications to real assessment data, showcasing the distinct insights each method offers to researchers and practitioners.
Correcting the Variance of Effect Sizes Based on Binary Outcomes for Clustering
Hedges LV
Researchers conducting systematic reviews and meta-analyses often encounter studies in which the research design is a well conducted cluster randomized trial, but the statistical analysis does not take clustering into account. For example, the study might assign treatments by clusters but the analysis may not take into account the clustered treatment assignment. Alternatively, the analysis of the primary outcome of the study might take clustering into account, but the reviewer might be interested in another outcome for which only summary data are available in a form that does not take clustering into account. This article provides expressions for the approximate variance of risk differences, log risk ratios, and log odds ratios computed from clustered binary data, using the intraclass correlations. An example illustrates the calculations. References to empirical estimates of intraclass correlations are provided.
Path Analysis With Mixed-Scale Variables: Categorical ML, Least Squares, and Bayesian Estimations
Liang X, Castro P, Cao C and Lo WJ
In applied research across education, the social and behavioral sciences, and medicine, path models frequently incorporate both continuous and ordinal manifest variables to predict binary outcomes. This study employs Monte Carlo simulations to evaluate six estimators: robust maximum likelihood with probit and logit links (MLR-probit, MLR-logit), mean- and variance-adjusted weighted and unweighted least squares (WLSMV, ULSMV), and Bayesian methods with noninformative and weakly informative priors (Bayes-NI, Bayes-WI). Across various sample sizes, variable scales, and effect sizes, results show that WLSMV and Bayes-WI consistently achieve low bias and RMSE, particularly in small samples or when mediators have few categories. By contrast, categorical MLR approaches tended to yield unstable estimates for modest effects. These findings offer practical guidance for selecting estimators in mixed-scale path analyses and underscore their implications for robust inference.
Common Persons Design in Score Equating: A Monte Carlo Investigation
Liu J, Jiang Z, Zheng T, Han Y and Feng S
The Common Persons (CP) equating design offers critical advantages for high-security testing contexts-eliminating anchor item exposure risks while accommodating non-equivalent groups-yet few studies have systematically examined how CP characteristics influence equating accuracy, and the field still lacks clear implementation guidelines. Addressing this gap, this comprehensive Monte Carlo simulation ( = 5,000 examinees per form; 500 replications) evaluates CP equating by manipulating 8 factors: test length, difficulty shift, ability dispersion, correlation between test forms and CP characteristics. Four equating methods (identity, IRT true-score, linear, equipercentile) were compared using normalized RMSE and %Bias. Key findings reveal: (a) when the CP sample size reaches at least 30, CP sample properties exert negligible influence on accuracy, challenging assumptions about distributional representativeness; (b) Test factors dominate outcomes-difficulty shifts ( = 1) degrade IRT precision severely (|%Bias| >22% vs. linear/equipercentile's |%Bias| <1.5%), while longer tests reduce NRMSE and wider ability dispersion ( = 1) enhances precision through improved person-item targeting; (c) Equipercentile and linear methods demonstrate superior robustness under form differences. We establish minimum operational thresholds: ≥30 CPs covering the score range suffice for precise equating. These results provide an evidence-based framework for CP implementation by systematically examining multiple manipulated factors, resolving security-vs-accuracy tradeoffs in high-stakes equating (e.g., credentialing exams) and enabling novel solutions like synthetic respondents.
Evaluation of Item Fit With Output From the EM Algorithm: RMSD Index Based on Posterior Expectations
Kim YK, Cai L and Kim Y
In item response theory modeling, item fit analysis using posterior expectations, otherwise known as pseudocounts, has many advantages. They are readily obtained from the E-step output of the Bock-Aitkin Expectation-Maximization (EM) algorithm and continue to function as a basis of evaluating model fit, even when missing data are present. This paper aimed to improve the interpretability of the root mean squared deviation (RMSD) index based on posterior expectations. In Study 1, we assessed its performance using two approaches. First, we employed the poor person's posterior predictive model checking (PP-PPMC) to compute their significance levels. The resulting Type I error was generally controlled below the nominal level, but power noticeably declined with smaller sample sizes and shorter test lengths. Second, we used receiver operating characteristic (ROC) curve analysis (±) to empirically determine the reference values (cutoff thresholds) that achieve an optimal balance between false-positive and true-positive rates. Importantly, we identified optimal reference values for each combination of sample size and test length in the simulation conditions. The cutoff threshold approach outperformed the PP-PPMC approach with greater gains in true-positive rates than losses from the inflated false-positive rates. In Study 2, we extended the cutoff threshold approach to conditions with larger sample sizes and longer test lengths. Moreover, we evaluated the performance of the optimized cutoff thresholds under varying levels of data missingness. Finally, we employed response surface analysis (±) to develop a prediction model that generalizes the way the reference values vary with sample size and test length. Overall, this study demonstrates the application of the PP-PPMC for item fit diagnostics and implements a practical frequentist approach to empirically derive reference values. Using our prediction model, practitioners can compute the reference values of RMSD that are tailored to their dataset's sample size and test length.
Reducing Calibration Bias for Person Fit Assessment by Mixture Model Expansion
Braeken J and van Laar S
Measurement appropriateness concerns the question of whether the test or survey scale under consideration can provide a valid measure for a specific individual. An aberrant item response pattern would provide internal counterevidence against using the test/scale for this person, whereas a more typical item response pattern would imply a fit of the measure to the person. Traditional approaches, including the popular Lz person fit statistic, are hampered by their two-stage estimation procedure and the fact that the fit for the person is determined based on the model calibrated on data that include the misfitting persons. This calibration bias creates suboptimal conditions for person fit assessment. Solutions have been sought through the derivation of approximating bias-correction formulas and/or iterative purification procedures. Yet, here we discuss an alternative one-stage solution that involves calibrating a model expansion of the measurement model that includes a mixture component for target aberrant response patterns. A simulation study evaluates the approach under the most unfavorable and least-studied conditions for person fit indices, short polytomous survey scales, similar to those found in large-scale educational assessments such as the Program for International Student Assessment or Trends in Mathematics and Science Study.
Dimensionality Assessment in Forced-Choice Questionnaires: First Steps Toward an Exploratory Framework
Graña DF, Kreitchmann RS, Sorrel MA, Garrido LE and Abad FJ
Forced-choice (FC) questionnaires have gained increasing attention as a strategy to reduce social desirability in self-reports, supported by advancements in confirmatory models that address the ipsativity of FC test scores. However, these models assume a known dimensionality and structure, which can be overly restrictive or fail to fit the data adequately. Consequently, exploratory models can be required, with accurate dimensionality assessment as a critical first step. FC questionnaires also pose unique challenges for dimensionality assessment, due to their inherently complex multidimensional structures. Despite this, no prior studies have systematically evaluated dimensionality assessment methods for FC data. To fill this gap, the present study examines five commonly used methods: the Kaiser Criterion, Empirical Kaiser Criterion, Parallel Analysis (PA), Hull Method, and Exploratory Graph Analysis. A Monte Carlo simulation study was conducted, manipulating key design features of FC questionnaires, such as the number of dimensions, items per dimension, response formats (e.g., binary vs. graded), and block composition (e.g., inclusion of heteropolar and unidimensional blocks), as well as factor loadings, inter-factor correlations, and sample size. Results showed that the Maximal Kaiser Criterion and PA methods outperformed the others, achieving higher accuracy and lower bias. Performance improved particularly when heteropolar or unidimensional blocks were included or when the questionnaire length increased. These findings emphasize the importance of thoughtful FC test design and provide practical recommendations for improving dimensionality assessment in this format.
Using Item Scores and Response Times to Detect Item Compromise in Computerized Adaptive Testing
Lee C, Gorney K and Chen J
Sequential procedures have been shown to be effective methods for real-time detection of compromised items in computerized adaptive testing. In this study, we propose three item response theory-based sequential procedures that involve the use of item scores and response times (RTs). The first procedure requires that either the score-based statistic or the RT-based statistic be extreme, the second procedure requires that both the score-based statistic and the RT-based statistic be extreme, and the third procedure requires that a combined score and RT-based statistic be extreme. Results suggest that the third procedure is the most promising, providing a reasonable balance between the false-positive rate and the true-positive rate while also producing relatively short lag times across a wide range of simulation conditions.
Impacts of DIF Item Balance and Effect Size Incorporation With the Rasch Tree
Asamoah NAB, Turner RC, Lo WJ, Crawford BL and Jozkowski KN
Ensuring fairness in educational and psychological assessments is critical, particularly in detecting differential item functioning (DIF), where items perform differently across subgroups. The Rasch tree method, a model-based recursive partitioning approach, is an innovative and flexible DIF detection tool that does not require the pre-specification of focal and reference groups. However, research systematically examining its performance under realistic measurement conditions, such as when multiple DIF items do not consistently favor one subgroup, is limited. This study builds on prior research, evaluating the Rasch tree method's ability to detect DIF by investigating the impact of DIF balance, along with other key factors such as DIF magnitude, sample size, test length, and contamination levels. Additionally, we incorporate the Educational Testing Service effect size heuristic as a criterion to compare the DIF detection rate performance with only statistical significance. Results indicate that the Rasch tree has better true DIF detection rates under balanced DIF conditions and large DIF magnitudes. However, its accuracy declines when DIF is unbalanced and the percentage of DIF contamination increases. The use of an effect size reduces the detection of negligible DIF. Caution is recommended with smaller samples, where detection rates are the lowest, especially for larger DIF magnitudes and increased DIF contamination percentages in unbalanced conditions. The study highlights the strengths and limitations of the Rasch tree method under a variety of conditions, underscores the importance of the impact of DIF group imbalance, and provides recommendations for optimizing DIF detection in practical assessment scenarios.
The One-Parameter Logistic Model Can Be True With Zero Probability for a Unidimensional Measuring Instrument: How One Could Go Wrong Removing Items Not Satisfying the Model
Raykov T and Zhang B
This note is concerned with the chance of the one-parameter logistic (1PL-) model or the Rasch model being true for a unidimensional multi-item measuring instrument. It is pointed out that if a single dimension underlies a scale consisting of dichotomous items, then the probability of either model being correct for that scale can be zero. The question is then addressed, what the consequences could be of removing items not following these models. Using a large number of simulated data sets, a pair of empirically relevant settings is presented where such item elimination can be problematic. Specifically, dropping items from a unidimensional instrument due to them not satisfying the 1PL-model, or the Rasch model, can yield potentially seriously misleading ability estimates with increased standard errors and prediction error with respect to the latent trait. Implications for educational and behavioral research are discussed.
Human Expertise and Large Language Model Embeddings in the Content Validity Assessment of Personality Tests
Milano N, Ponticorvo M and Marocco D
In this article, we explore the application of Large Language Models (LLMs) in assessing the content validity of psychometric instruments, focusing on the Big Five Questionnaire (BFQ) and Big Five Inventory (BFI). Content validity, a cornerstone of test construction, ensures that psychological measures adequately cover their intended constructs. Using both human expert evaluations and advanced LLMs, we compared the accuracy of semantic item-construct alignment. Graduate psychology students employed the Content Validity Ratio to rate test items, forming the human baseline. In parallel, state-of-the-art LLMs, including multilingual and fine-tuned models, analyzed item embeddings to predict construct mappings. The results reveal distinct strengths and limitations of human and AI approaches. Human validators excelled in aligning the behaviorally rich BFQ items, while LLMs performed better with the linguistically concise BFI items. Training strategies significantly influenced LLM performance, with models tailored for lexical relationships outperforming general-purpose LLMs. Here we highlight the complementary potential of hybrid validation systems that integrate human expertise and AI precision. The findings underscore the transformative role of LLMs in psychological assessment, paving the way for scalable, objective, and robust test development methodologies.
How to Improve the Regression Factor Score Predictor When Individuals Have Different Factor Loadings
Beauducel A, Hilger N and Weide AC
Previous research has shown that ignoring individual differences of factor loadings in conventional factor models may reduce the determinacy of factor score predictors. Therefore, the aim of the present study is to propose a heterogeneous regression factor score predictor (HRFS) with larger determinacy than the conventional regression factor score predictor (RFS) when individuals have different factor loadings. First, a method for the estimation of individual loadings is proposed. The individual loading estimates are used to compute the HRFS. Then, a binomial test for loading heterogeneity of a factor is proposed to compute the HRFS only when the test is significant. Otherwise, the conventional RFS should be used. A simulation study reveals that the HRFS has larger determinacy than the conventional RFS in populations with substantial loading heterogeneity. An empirical example based on subsamples drawn randomly from a large sample of Big Five Markers indicates that the determinacy can be improved for the factor emotional stability when the HRFS is computed.
The Dominant Trait Profile Method of Scoring Multidimensional Forced-Choice Questionnaires
Dimitrov DM
Proposed is a new method of scoring multidimensional forced-choice (MFC) questionnaires referred to as the dominant trait profile (DTP) method. The DTP method identifies a dominant response vector (DRV) for each trait-a vector of binary scores for preferences in item pairs within MFC blocks from the perspective of a respondent for whom the trait under consideration dominates over the other traits being measured. The respondents' observed response vectors are matched to the DRV for each trait to produce (1/0) matching scores that are then analyzed via latent trait modeling, with scaling options (a) bounded D-scale (from 0 to 1), or (b) item response theory logit scale. The DTP method allows for the comparison of individuals on a trait of interest, as well as their standing in relation to a dominant trait "standard" (criterion). The study results indicate that DTP-based trait estimates are highly correlated with those produced by the popular Thurstonian item response theory model and the Zinnes and Griggs pairwise preference item response theory model, while avoiding the complexity of their designs and some computations issues.
A Comparison of LTA Models with and Without Residual Correlation in Estimating Transition Probabilities
Lee NY, Yoon S and Hong S
In longitudinal mixture models like latent transition analysis (LTA), identical items are often repeatedly measured across multiple time points to define latent classes and individuals' similar response patterns across multiple time points, which attributes to residual correlations. Therefore, this study hypothesized that an LTA model assuming residual correlations among indicator variables measured repeatedly across multiple time points would provide more accurate estimates of transition probabilities than a traditional LTA model. To test this hypothesis, a Monte Carlo simulation was conducted to generate data both with and without specified residual correlations among the repeatedly measured indicator variables, and the two LTA models-one that accounted for residual correlations and one that did not-were compared. This study included transition probabilities, numbers of indicator variables, sample sizes, and levels of residual correlations as the simulation conditions. The estimation performances were compared based on parameter estimate bias, mean squared error, and coverage. The results demonstrate that LTA with residual correlations outperforms traditional LTA in estimating transition probabilities, and the differences between the two models become prominent when the residual correlation is .3 or higher. This research integrates the characteristics of longitudinal data in an LTA simulation study and suggests an improved version of LTA estimation.
Proportion Explained Component Variance in Second-Order Scales: A Note on a Latent Variable Modeling Approach
Raykov T, DiStefano C and Ransome Y
A procedure for evaluation of the proportion explained component variance by the underlying trait in behavioral scales with second-order structure is outlined. The resulting index of accounted for variance over all scale components is a useful and informative complement to the conventional omega-hierarchical coefficient as well as the proportion of explained component correlation. A point and interval estimation method is described for the discussed index, which utilizes a confirmatory factor analysis approach within the latent variable modeling methodology. The procedure can be used with widely available software and is illustrated on data.