Update Introduction.Rmd

CarlosPoses · May 28, 2024 · 7fba6b6 · 7fba6b6
1 parent 265f45e
commit 7fba6b6
Showing 1 changed file with 88 additions and 37 deletions.
diff --git a/vignettes/Introduction.Rmd b/vignettes/Introduction.Rmd
@@ -31,35 +31,25 @@ Throughout the vignette, we illustrate the functionality using the penguins data
 library(palmerpenguins)
 library(dplyr)
 library(tidyr)
+
 data <- penguins |> 
   select(starts_with("bill") | starts_with("flipper") | starts_with("body_mass") | year) |> 
   na.omit()
-  
-summary(data)
 
-summary(data)
-psych::describe(data)
+set.seed(444)
+
+# Numerator and denominator have equal distributions
+index <- sample(1:nrow(data), size = nrow(data) * 0.5)
+eq_numerator <- data[index,] 
+eq_denominator <- data[-index,]
 
+# Numerator and denominator have different distributions
 prob <- data |> 
   mutate(z = -9 + 0.0025*body_mass_g,
          prob = 1/(1 + exp(-z))) 
-cor(prob)
-set.seed(431)
 index <- sample(1:nrow(data), size = nrow(data) * 0.5, prob = prob$prob)
-index
-numerator <- data[index,]
-denominator <- data[-index, ]
-numerator
-denominator
-summary(numerator)
-summary(denominator)
-
-numerator <- data |> 
-  filter(body_mass_g > 4050) 
-denominator <- data |> 
-  filter(body_mass_g <= 4050)
-
-predict(densratio)
+dif_numerator <- data[index,]
+dif_denominator <- data[-index, ]
 ```
 
 
@@ -78,18 +68,15 @@ The package includes different methods for density ratio estimation. Each of the
 For instance, we can use the function `ulsif()`, which accounts for 'unconstrained least-squares importance fitting'. The function is called as follows:
 
 ```{r}
-densratio <- ulsif(numerator, denominator)
-plot(densratio, logscale = FALSE)
-hist(predict(densratio, newdata = numerator))
-hist(predict(densratio, newdata = denominator))
-debug(densityratio:::dr.histogram)
+densratio_diff <- ulsif(dif_numerator, dif_denominator)
+densratio_eq <- ulsif(eq_numerator, eq_denominator)
 ```
 If no extra parameters are provided, the function uses default values and, when possible, determines optimal parameters through cross-validation. Different methods are available for the resulting object.
 
 ```{r}
 summary(densratio, test = TRUE)
 ```
-The summary method displays information regarding estimation (e.g., optimal sigma and lambda), the Pearson divergence (a measure of discrepancy between densities), and the test results. 
+The summary method displays information regarding estimation (e.g., optimal sigma and lambda), the Pearson divergence (a measure of discrepancy between densities), and the test results. The rest results are interpreted as follows ...ADD.
 
 Other density ratio estimation methods are available, concretely: Kullback-Leibler importance estimation procedure (`kliep()`), ratio of estimated densities (`naive()`), ratio of estimated densities after dimension reduction (`naivesubspace()`), least-squares heterodistributional subspace search (`lhss()`), and 'spectral estimation?` (`spectral()`).
 
@@ -98,37 +85,101 @@ There is also a `print` method, and a `predict` method. The predict method can b
 ```{r}
 predict(densratio, data = numerator)
 print(densratio)
-plot(densratio)
 ```
 
 
 # Visualization methods
 
 The package includes different visualization methods to examine the density ratio results. The plots are built with `ggplot2`, and thus return a customizable `ggplot2` object. 
 
-The default `plot` method results in a histogram of the log of the density ratio values. The `plot()` method takes different arguments, so you can plot the densityratio for only some samples, not using the log scale, etc. You can check them with `?plot.ulsif`.
+The default `plot` method results in a histogram of the log of the density ratio values. The `plot()` method takes different arguments, so you can plot the densityratio for only some samples, using the original (and not log transformed) scale, etc. You can check them with `?plot.ulsif`.
+
+The default function is meant to given an overview of the similarity of the samples. For equal samples, the density ratio should be 1, and thus its log should be zero. Therefore, we expect the distribution to be centered around zero, and with no major differences of the distribution between numerator and denominator samples. 
+
+```{r}
+plot(densratio_eq)
+```
 
-The function is meant to given an overview of the similarity of the samples. For equal samples, the density ratio should be 1, and thus its log should be zero. Therefore, we expect the distribution to be centered around zero, and with no major differences of the distribution between numerator and denominator samples. 
 
-In cases where the numerator and denominator samples have different densities, the density ratio values will be different from 1. We no longer expect the distribution to be centered at zero, and we expect to see differences in the distributions between the numerator and denominator samples (one having larger values than the other).
+In cases where the numerator and denominator samples have different densities, the density ratio values will be different from 1. We no longer expect the distribution to be centered at zero, and we expect to see differences in the distributions between the numerator and denominator samples (one having larger values than the other). Notice the change of the scale of x-axis between the next and previous plots.
 ´
 ```{r}
-plot(densratio)
+plot(densratio_diff)
 ```
-There are two other visualization functions, `plot_univariate()` and `plot_bivariate()`. The `plot_univariate()` function plots the value of the density ratio for each variable separately, againas the density ratio. The function plot_
+There are two other visualization functions, `plot_univariate()` and `plot_bivariate()`. The `plot_univariate()` function plots the value of the density ratio for each variable separately, against the density ratio, and see if there are any patterns. For instance, we can plot the density ratio against the values of "body mass". 
 
-while the `plot_bivariate()` function plots the value of the density ratio for pairs of variables. 
+There is a very clear non-linear pattern with, in general, lower values of body mass presenting lower density ratio estimates. The clarity of this pattern is unlikely in the real data, but it is what would we expected for the data: it represents exactly the way in which the probability of pertaining to the numerator data was made dependent on the body mass.
 
+```{r}
+plot_univariate(densratio_diff, vars = "body_mass_g")
+```
+We can also plot the same plot for all variables at once, allowing us to see that some (milder) relationships are present with respect to bill depth and, especially, bill length and flipper length. This is also not surprising, because this variables are correlated with body mass. Other plotting options (e.g., making separate facets for denominator and numerator samples, plotting only subesets of the data), are possible.
 
-The first one plots the density ratio values for each variable separately, while the second one plots the density ratio values for pairs of variables. 
-
 
+```{r}
+plot_univariate(densratio_diff, grid = TRUE) 
+```
 
-most important visualization method is the `plot` method, which can be used to plot the density ratio values. The plot method can be used to plot the density ratio values for the observed data, or for new data.
+
+On the other hand, the `plot_bivariate()` function plots the value of the density ratio for pairs of variables. We need to select which variables we want in the x-axis (vars1), and which ones we want in the y-axis (vars2). The function will plot a scatterplot of both variables, for each combination of variables, and color the points according to the density ratio values.
+
+```{r}
+plot_bivariate(densratio_diff, vars1 = c("bill_length_mm", "bill_depth_mm", "flipper_length_mm"), vars2 = c("flipper_length_mm", "body_mass_g", "bill_depth_mm"), grid = TRUE)
+```
+The bivariate plot gives a nice overview of how the densityratio varies on multivariate space. For instance, it is clear that lower values of body mass index are associated with lower density ratios (i.e., this values are more likely to appear in the denominator data). There is also an interesting relationship between bill depth and the densityratio. For low values of bill depth, the log of the densityratio is 1, meaning these observations are more likely to come from the numerator data. The relatioship changes depending ont he values of the other variables, etc.
 
 
 # User case: covariate shift adaptation
 
-https://archive.ics.uci.edu/dataset/1/abalone
+```{r}
+library(randomForest)
+library(gbm)
+
+data <- 
+  penguins |> 
+  select(starts_with("bill") | starts_with("flipper") | starts_with("body_mass") | year | species) |> 
+  na.omit()
+
+train <- data[index,]
+test <- data[-index,]
+
+weights<-predict(densratio_diff, newdata = dif_numerator)[,,1]
+
+
+
+# Do same as randomforest, but for logistic model
+glm_train <- glm(formula = species ~ . ,
+    data = train, 
+    family = binomial)
+
+table(predict(glm_train, test, type = "response") > 0.5, test$species)
+
+glm_train_reweighted <- glm(formula = species ~ . ,
+    data = train, 
+    family = binomial,
+    weights = weights)
+
+table(predict(glm_train_reweighted, test, type = "response") > 0.5, test$species)
+
+```
+
+
+```{r}
+
+weights
+
+randomforest <- randomForest(species ~ . , train)
+predict(randomforest, test)
+
+# confusion matrix
+table(predict(randomforest, test), test$species)
+
+
+randomForest(formula = species ~ . ,
+             data = train, 
+             classwt = weights)
+```
+
+
 
 lm, randomForest, gbm