Scott's pi coefficient

Overview

The pi coefficient is a chance-adjusted index for the reliability of categorical measurements. It estimates chance agreement using a distribution-based approach. It assumes that observers have a conspired "quota" for each category that they work together to meet.

History

Scott (1955) proposed the pi coefficient to estimate the reliability of two raters assigning items to nominal categories. Fleiss (1971) extended the pi coefficient to accommodate multiple raters. Then, Gwet (2014) generalized the pi coefficient to accommodate multiple raters, any weighting scheme, and missing data. The generalized formulas provided here, and instantiated in the FULL_PI function, correspond to Gwet's formulation (which he refers to as the generalized Fleiss' kappa coefficient). The simplified formulas correspond to It is also worth noting that several other reliability indices are equivalent to Scott's pi coefficient including Siegel & Castellan's (1988) revised kappa coefficient and Byrt, Bishop, and Carlin's (1993) bias-adjusted kappa coefficient.

MATLAB Functions

FAST_PI %Calculates pi using simplified formulas
FULL_PI %Calculates pi using generalized formulas

Simplified Formulas

Use these formulas with two raters and two (dichotomous) categories:

p_o

m_1

m_2

p_c

n_11 is the number of items both raters assigned to category k_1

n_22 is the number of items both raters assigned to category k_2

is the total number of items

n_1+ is the number of items rater r_1 assigned to category k_1

n_2+ is the number of items rater r_1 assigned to category k_2

n_+1 is the number of items rater r_2 assigned to category k_1

n_+2 is the number of items rater r_2 assigned to category k_2

Contingency Table

Generalized Formulas

Use these formulas with multiple raters, multiple categories, and any weighting scheme:

is the total number of categories

w_kl is the weight associated with two raters assigning an item to categories and

r_il is the number of raters that assigned item to category

is the number of items that were coded by two or more raters

r_ik is the number of raters that assigned item to category

r_i is the number of raters that assigned item to any category

is the total number of items

References

Scott, W. A. (1955). Reliability of content analysis: The case of nominal scaling. Public Opinion Quarterly, 19(3), 321–325.
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.
Siegel, S., & Castellan, N. J. (1988). Nonparametric statistics for the behavioural sciences. New York, NY: McGraw-Hill.
Byrt, T., Bishop, J., & Carlin, J. B. (1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46, 423–429.
Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (4th ed.). Gaithersburg, MD: Advanced Analytics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly