README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

library("shapley")
```
# shapley

[![R-CMD-check](https://github.com/elbersb/shapley/workflows/R-CMD-check/badge.svg)](https://github.com/elbersb/shapley/actions)
[![Coverage status](https://codecov.io/gh/elbersb/shapley/branch/master/graph/badge.svg)](https://codecov.io/github/elbersb/shapley?branch=master)

The Shapley value is a concept from game theory that quantifies how much each player contributes
to the game outcome (Shapley 1953). The concept, however, has many more use cases: it provides
a method to quantify the importance of predictors in regression analysis or machine learning models,
and can be used in a wide variety of decomposition problems (Shorrocks 2013).
Most implementations focus on one narrow use case, although the algorithm for the Shapley
value decomposition is always the same -- it is just the concrete value function that varies.
This package provides a simple algorithm for the Shapley value decomposition, and also
supports hierarchical decomposition using the Owen value.

The key advantage of the Shapley decomposition framework is the connection with counterfactuals:
Once appropriate counterfactuals for each combination of factors have been identified, the
method will produce an appropriate decomposition.

## Installation

``` r
devtools::install_github("elbersb/shapley")
```

## Usage

The package provides a `shapley` function that takes two main arguments: the value function and a
vector of factor names. The value function needs to be an R function that takes one or more
arguments, where the first argument defines the factors that are included in the calculation of the
outcome value. The `shapley` function will call the value function repeatedly,
each time with a different set of factors.

For a very simple example, consider that an outcome is determined by two factors "A" and "B", which
contribute 1 and 2, respectively. (The factors are linearly additive, which makes the use of Shapley
value decomposition unnecessary, but it works as an illustration.)
The value function is thus defined as:

``` {r}
simple <- function(factors = c()) {
    value <- 0
    if ("A" %in% factors) value <- value + 1
    if ("B" %in% factors) value <- value + 2
    return(value)
}
```

We now supply the value function to `shapley`, along with the factor names:

``` {r}
shapley(simple, c("A", "B"), silent = TRUE)
```

As expected, the marginal contributions of the two factors are 1 and 2, respectively.
For the two factors, we can manually compute the contribution as follows:

``` {r}
# A:
1/2 * (simple("A") - simple()) + 1/2 * (simple(c("A", "B")) - simple("B"))
# B:
1/2 * (simple("B") - simple()) + 1/2 * (simple(c("B", "A")) - simple("A"))
```

Across the two computations, most terms occur twice. Also note that
`simple(c("A", "B")) == simple(c("B", "A"))`. The `shapley` function only calculates each term
once, and then caches the result. This leads to great speed improvements once we consider
a greater number of factors.

## Example 1: Game theory

For this example (taken from [Wikipedia](https://en.wikipedia.org/wiki/Shapley_value#Glove_game)),
consider three players. Players 1 and 2 supply right-hand gloves,
while Player 3 supplies a left-hand glove. The game is only successful if players with both types
of gloves enter into a coalition.
We thus define the value function as 1 if pairs `{1,3}`, `{2,3}` or `{1,2,3}`
are formed, and 0 otherwise. In R code:

``` {r}
glove <- function(factors) {
    if (length(factors) > 1 & 3 %in% factors) return(1)
    return(0)
}
```

To compute the marginal contributions of each player, use:

``` {r}
shapley(glove, c(1, 2, 3), silent = TRUE)
```

## Example 2: Relative importance of predictors

Consider this simple regression model and its R<sup>2</sup>:

```{r}
model <- lm(mpg ~ wt + qsec + am, data = mtcars)
summary(model)$r.squared
```

The Shapley value decomposition allows us to determine how much each predictor
contributes to the R<sup>2</sup>. To do this, we need to define the value function in a way
that it runs the regression with the appropriate subset of predictors. It should return 0 when
there are no predictors:

``` {r}
reg_mtcars <- function(factors) {
    if (length(factors) == 0) return(0)
    formula <- paste("mpg ~", paste(factors, collapse = "+"))
    m <- lm(formula, data = mtcars)
    summary(m)$r.squared
}

# test - should be the same as above:
reg_mtcars(c("wt", "qsec", "am"))
```

```{r}
shapley(reg_mtcars, c("wt", "qsec", "am"), silent = TRUE)
```

We can also generalize the value function to apply to any dataset and dependent variable:

```{r}
reg <- function(factors, dv, data) {
    if (length(factors) == 0) return(0)
    formula <- paste(dv, "~", paste(factors, collapse = "+"))
    m <- lm(formula, data = data)
    summary(m)$r.squared
}

shapley(reg, c("cyl", "hp", "am"), silent = TRUE, dv = "wt", data = mtcars)
```

Note that there are many packages (e.g., [relaimpo](https://cran.r-project.org/package=relaimpo))
that provide this functionality specifically for regression analysis.

## Example 3: Effects of taxes and transfers on inequality

Another classic use case for the Shapley value is the decomposition of inequality indices
(see Shorrocks 2013 among others). Enami et al. (2018) provide a simple example to show such
a decomposition in the context of measuring the impact of taxes and transfers on income inequality.

Consider the following dataset `income`, showing the market incomes, taxes paid, transfers
received, and the resulting final incomes for five individuals:

```{r echo = FALSE}
income <- data.frame(
    MarketIncome = c(1, 20, 30, 40, 50),
    Tax = -5,
    Transfer = c(9, 7, 5, 3, 1),
    FinalIncome = c(5, 22, 30, 38, 46))
knitr::kable(income)
```

The Gini indices of the market and final incomes are:

```{r}
gini_market <- ineq::Gini(income[["MarketIncome"]])
gini_final <- ineq::Gini(income[["FinalIncome"]])
round(c(gini_market, gini_final, gini_final - gini_market), 3)
```

Taxes and transfers combined thus reduced inequality by about 0.057. There are now two different
approaches to dividing this difference among the two factors (i.e., taxes and transfers).
In what Sastre and Trannoy (2002) call the "zero income decomposition" (ZID), sources not under
consideration are set to zero. In the alternative scenario, "equalized income decomposition" (EID),
those sources are distributed evenly among the population. Both scenarios are easily implemented
using different value functions:

```{r}
zid <- function(factors, data) {
    cntf <- data[["MarketIncome"]]  # baseline for counterfactual income
    for (f in factors)
        cntf <- cntf + data[[f]]
    ineq::Gini(cntf)
}

eid <- function(factors, data) {
    cntf <- data[["MarketIncome"]]
    if ("Tax" %in% factors)
        cntf <- cntf + data[["Tax"]]
    else
        cntf <- cntf + mean(data[["Tax"]])
    if ("Transfer" %in% factors)
        cntf <- cntf + data[["Transfer"]]
    else
        cntf <- cntf + mean(data[["Transfer"]])
    ineq::Gini(cntf)
}
```

These equalities hold in both scenarios:

```{r}
zid(c(), income) == gini_market
zid(c("Tax", "Transfer"), income) == gini_final
eid(c(), income) == gini_market
eid(c("Tax", "Transfer"), income) == gini_final
```

Note that for EID, the first equality only holds because the sum of taxes and transfers is zero,
i.e., those two sources cancel each other out. Once this is no longer the case, the EID method
runs into problems (see Enami et al. for a detailed discussion).
In any case, ZID and EID give different answers when only one factor is included:

```{r}
zid("Tax", income)
eid("Tax", income)
```

This is because in the zero income scenario, transfers are set to zero when only taxes are
considered, while in the equalized income scenario, transfers are distributed equally among the
individuals. The Shapley values of the two scenarios are the following:

```{r}
shapley(zid, c("Tax", "Transfer"), silent = TRUE, data = income)
shapley(eid, c("Tax", "Transfer"), silent = TRUE, data = income)
```

Whether ZID or EID is appropriate depends on the context. Sastre and Trannoy (2002) and
Enami et al. (2018) address this question in further detail.

## Example 4: Hierarchical Shapley decomposition (Owen values)

Continuing from the previous example (and again borrowing from Enami et al.), consider the case
that the tax shown above is actually composed of two different taxes, `Tax1` and `Tax2` (these
two columns sum to the column `Tax` in the previous example):

```{r echo = FALSE}
income2 <- data.frame(
    MarketIncome = c(1, 20, 30, 40, 50),
    Tax1 = c(0, -1, -2, -3, -4),
    Tax2 = c(-5, -4, -3, -2, -1),
    Transfer = c(9, 7, 5, 3, 1),
    FinalIncome = c(5, 22, 30, 38, 46))
knitr::kable(income2)
```

When we now decompose this dataset (`income2`) by three factors, we get the following results:

```{r}
# we can reuse the `zid` function from above,
# while the `eid` function would need to be adapted
owen(zid, c("Tax1", "Tax2", "Transfer"), silent = TRUE, data = income2)
```

Note that the sum of the contributions of the two taxes does not equal the contribution for
the tax above, although this is just the sum of the two separate taxes. Furthermore, the size of the
transfer component is affected. As Enami et al. (2018, p. 108) write:

> Given that no new tax has been added and that the only change is
> that some additional information about the sources of taxes has been
> included in the analysis, it is inconvenient that the Shapley value for
> transfers has also changed.

This is a unfortunate property of the Shapley decomposition, but it can be partially remedied
by using a hierarchical procedure, the Owen value decomposition (Owen 1977). (An alternative
is the Nested Shapley decomposition recommended by Sastre and Trannoy (2002), which introduces
a new set of problems, though.) The `shapley` package allows the computation of Owen values by
specifying the group structure using a list of vectors:

```{r}
owen(zid, list(c("Tax1", "Tax2"), c("Transfer")), silent = TRUE, data = income2)
```

Using this notation, we have grouped `Tax1` and `Tax2` together in one group, while `Transfer` is a
group in itself. The results now line up with the results of the Shapley decomposition above,
where the taxes were jointly entered as a single factor.

Note that the hierarchical procedure can also be used
as an effective tool to increase the speed of computation when a large number of factors
is included. For instance, when 8 factors are considered, 8! = 40320 permutations need to be
calculated for each factor. Once the 8 factors are grouped into two groups with 4 factors each,
the number of permutations that need to be calculated for each factor is only 2! * 4! * 4! = 1152.


## References

Enami, A., N. Lustig, and R. Aranda. 2018.
    Analytic Foundations: Measuring the Redistributive Impact of Taxes and Transfers.
    In: N. Lustig (Ed.), _Commitment to Equity Handbook.
    Estimating the Impact of Fiscal Policy on Inequality and Poverty_,
    Washington, D.C.: Brookings, 56-115.

Owen, G. 1977. Values of Games with a Priori Unions.
    In: R. Henn and O. Moeschlin (Eds.),
    _Mathematical Economics and Game Theory_, Berlin and Heidelberg: Springer, 76-88.

Sastre, M. and A. Trannoy. 2002. Shapley Inequality Decomposition by Factor Components:
    Some Methodological Issues.
    _Journal of Economics_ 77(1): 51-89. <https://doi.org/10.1007/BF03052500>

Shapley, L. S. 1953. A value for n-person games.
    In: A. W. Tucker and H. W. Kuhn (Eds.),
    _Contributions to the theory of games (Vol. II)_,
    Princeton: Princeton University Press, 307–317.

Shorrocks, A. F. 2013.
    Decomposition procedures for distributional analysis:
    a unified framework based on the Shapley value. _Journal of Economic Inequality_ 11: 1-28.
    <https://doi.org/10.1007/s10888-011-9214-z>