-
Notifications
You must be signed in to change notification settings - Fork 5
Calibration
Let's perform a simple calibration on margins using icarus and the example dataset data_employees
. First, load icarus:
library(icarus)
The example dataset is loaded along with icarus:
> data_employees
id department category sex salary movies weight
1 a01 1 1 1 1000 1 10
2 a02 1 2 2 1100 2 10
3 a03 2 2 2 1500 4 10
4 a04 2 3 1 2300 15 10
5 a05 2 1 1 1000 2 10
6 a06 1 1 2 500 3 10
7 a07 2 2 2 1000 1 10
8 b01 1 3 2 2000 0 20
9 b02 1 1 1 2100 0 20
10 b03 2 2 1 2000 3 20
11 b04 2 1 2 3200 6 20
12 b05 1 1 2 1800 0 20
13 b06 1 2 1 2800 0 20
14 b07 1 3 1 1100 1 20
15 b08 2 1 2 2500 1 20
Our example dataset is the result of a survey conducted among the 300 employees of a firm, in order to measure how many times per month employees go to the movies (quantitative variable, column movies)
Sampling weights are given in the column weight. Auxiliary variables here are:
- department, the department in which the employee works (categorical, 2 modalities)
- category, the hierarchical level of the employee in the company (categorical, 3 modalities)
- sex, sex of the employee (categorical, 2 modalities)
- salary, salary of the employee (quantitative)
The total number of employees in the firm is 300:
N <- 300
The mean number of times employees go the movies each month can be estimated using the Horvitz-Thompson estimator:
> 1/N * HTtotal(data_employees$movies, data_employees$weight)
1.666667
The goal of using calibration on auxiliary variables (margins) is to improve the Horvitz-Thompson estimate by using known totals of these auxiliary variables. In this case, we know the number of employees in each category for the categorical auxiliary variables:
- category: 80 (modality 1) ; 90 (modality 2) ; 60 (modality 3)
- sex: 140 (modality 1) ; 90 (modality 2)
- department: 100 (modality 1) ; 130 (modality 2)
We also know that the total salaries paid by the company are: 470000
To compute the calibration estimator, we create the margin matrix, which contains the totals of the auxiliary variables. The format of the margin matrix is very similar to the margin table in the SAS macro "Calmar":
## Calibration margins
mar1 <- c("category",3,80,90,60)
mar2 <- c("sex",2,140,90,0)
mar3 <- c("department",2,100,130,0)
mar4 <- c("salary", 0, 470000,0,0)
margins <- rbind(mar1, mar2, mar3, mar4)
wCalesRaking <- calibration(data=data_employees, marginMatrix=margins, colWeights="weight"
, method="raking", description=FALSE)
Using the parameter description=TRUE
would ouptut stats on the calibration method (distribution of initial v. calibrated weights).
In our example, the calibrated weights are stored in vector wCalesRaking
. We can now compute the calibration estimator:
> 1/N * HTtotal(data_employees$movies, wCalesRaking)
2.471917
Just like in Calmar, it is also possible to (which is convenient when dealing with huge numbers, for example on large populations):
mar1_2 <- c("category",3,0.35,0.4,0.25)
mar2_2 <- c("sex",2,0.60,0.40,0)
mar3_2 <- c("department",2,0.45,0.55,0)
mar4_2 <- c("salary", 0, 470000,0,0)
margins_2 <- rbind(mar1_2, mar2_2, mar3_2, mar4_2)
In this case, set parameter pct to TRUE when performing calibration:
wCalRakingPct <- calibration(data=data_employees, marginMatrix=margins_2, colWeights="weight"
, method="logit", description=FALSE, bounds=c(0.4,2.2), pct=TRUE
, popTotal=N)
As of version 0.3.0, Icarus also supports margins written as percentages adding up to 100:
mar1_3 <- c("category",3,35,40,25)
mar2_3 <- c("sex",2,60,40,0)
mar3_3 <- c("department",2,45,55,0)
mar4_3 <- c("salary", 0, 470000,0,0)
margins_3 <- rbind(mar1_3, mar2_3, mar3_3, mar4_3)
wCalRakingPct <- calibration(data=data_employees, marginMatrix=margins_3, colWeights="weight"
, method="logit", description=FALSE, bounds=c(0.4,2.2), pct=TRUE
, popTotal=N)
Other distances than the raking ratio are implemented in gaston. For example, we can use the logit method, which allows us to set bounds on the ratio calibrated weights / initial weights:
wCalesLogit1 <- calibration(data=data_employees, marginMatrix=margins, colWeights="weight"
, method="logit", description=FALSE, bounds=c(0.4,2.2))