-
Notifications
You must be signed in to change notification settings - Fork 8
Scott's pi coefficient
The pi coefficient is a chance-adjusted index for the reliability of categorical measurements. It estimates chance agreement using a distribution-based approach. It assumes that observers have a conspired "quota" for each category that they work together to meet.
Scott (1955) proposed the pi coefficient to estimate the reliability of two raters assigning items to nominal categories. Fleiss (1971) extended the pi coefficient to accommodate multiple raters. Then, Gwet (2014) generalized the pi coefficient to accommodate multiple raters, any weighting scheme, and missing data. The generalized formulas provided here, and instantiated in the FULL_PI function, correspond to Gwet's formulation (which he refers to as the generalized Fleiss' kappa coefficient). The simplified formulas correspond to It is also worth noting that several other reliability indices are equivalent to Scott's pi coefficient including Siegel & Castellan's (1988) revised kappa coefficient and Byrt, Bishop, and Carlin's (1993) bias-adjusted kappa coefficient.
Use these formulas with two raters and two (dichotomous) categories:
is the number of items both raters assigned to category
is the number of items both raters assigned to category
is the total number of items
is the number of items rater assigned to category
is the number of items rater assigned to category
is the number of items rater assigned to category
is the number of items rater assigned to category
Use these formulas with multiple raters, multiple categories, and any weighting scheme:
is the total number of categories
is the weight associated with two raters assigning an item to categories and
is the number of raters that assigned item to category
is the number of items that were coded by two or more raters
is the number of raters that assigned item to category
is the number of raters that assigned item to any category
is the total number of items
- Scott, W. A. (1955). Reliability of content analysis: The case of nominal scaling. Public Opinion Quarterly, 19(3), 321–325.
- Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.
- Siegel, S., & Castellan, N. J. (1988). Nonparametric statistics for the behavioural sciences. New York, NY: McGraw-Hill.
- Byrt, T., Bishop, J., & Carlin, J. B. (1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46, 423–429.
- Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (4th ed.). Gaithersburg, MD: Advanced Analytics.