-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmodelbuilding.Rmd
167 lines (110 loc) · 6.22 KB
/
modelbuilding.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
title: "Model Building"
output:
html_document:
code_folding: hide
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
options(scipen = 999)
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
message = FALSE,
fig.width = 12
)
## load packages
library(tidyverse)
library(corrplot)
library(glmnet)
library(modelr)
```
<br>
# Overview
We summarize the statistics of world cups from 1990 to 2018 for all the countries that made to the World Cup to help us make preditions of wining in the 2022 World Cup.
We tried different ways for model selection, including stepwise selection, forward and backward selection, and LASSO, with different selecting criterias, such as AIC, BIC, and p-values. After comparing the rmse, R-square and p-values, we decide to use the model from forward selection with p-value.
While there are two variables that are on the borderline of our threshold, alpha = 0.05, we got three different models to include either or both of these two variables. Then, we used cross validation to check which model has the best performance.After comparing their rmse, we get the final model - w = -1.554369 + 0.661219pld - 0.622216d + 0.015920rank + 0.153794gf -0.225378ga.
<br>
# Potential predictors
<br>
#### VARIABLES AND DEFINITIONS
`country`: Individual country (represents national soccer team)
`part`: Number of times each country has participated in the World Cup (all time)
`pld`: Number of games played in the World Cup (since 1990)
`w`: Number of soccer games a country has won in the World Cup (since 1990)
`d`: Number of soccer games a country has drawn in the World Cup (since 1990)
`l`: Number of soccer games a country has lost in the World Cup (since 1990)
`gf`: Number of goals a country has scored against opponents in the World Cup (since 1990)
`ga`: Number of goals scored against a country in the World Cup (since 1990)
`gd`: Goal difference of country (goals scored by country minus goals scored against country)
`pts`: Number of points accumulated by country in the World Cup (since 1990)
`rank`: Country Official FIFA ranking (2022 Rankings)
`player`: Name of top record goal scorer
`goals`: Number of goals scored by top record goal scorer
`land_area_km`: Total area of the land-based portions of a country’s geography (measured in square kilometers, km²)
`confederation`: Country FIFA confederation
<br>
# Correlation Plot
In the dataset, we had 14 variables and 79 observations. To check for collinearity, we use correlation plots to explore the relationship between each pair of numeric predictors. To only include the numeric variables, we use the `select` function to exclude the qualitative variables, such as player, country, and gd. The deeper color in the plot means stronger correlation between the two variables.
```{r}
data <- read.csv("./data/worldcup_final.csv") %>%
na.omit() %>%
select(-country)
num = data[2:14] %>%
select(-gd, -player)
corrplot(cor(num), diag = FALSE)
```
<br>
# Cross-validation
We decided to use forward selection for our final model, where we start with only the intercept and add one predictor at a time. The criteria we chose is p-value, which means we will select predictors to include based on p-values, until no more predictors having p-value less than 0.05 can be added to the model.
While there are two variables that are on the borderline of our threshold, alpha = 0.05, we created three different models to include either or both of these two variables. Then, we used cross validation to check which model has the best performance.
```{r}
cv_df =
crossv_mc(data, 100) %>%
mutate(
train = map(train, as_tibble),
test = map(test, as_tibble))%>%
mutate(
train = map(train, as_tibble),
test = map(test, as_tibble))%>%
mutate(
fit1 = map(train, ~lm(w ~ pld + d + rank, data = .x)),
fit2 = map(train, ~lm(w ~ pld + d + rank + gf, data = .x)),
fit3 = map(train, ~lm(w ~ pld + d + rank + gf + ga, data = .x))) %>%
mutate(
rmse_fit1 = map2_dbl(fit1, test, ~rmse(model = .x, data = .y)),
rmse_fit2 = map2_dbl(fit2, test, ~rmse(model = .x, data = .y)),
rmse_fit3 = map2_dbl(fit3, test, ~rmse(model = .x, data = .y)))
cv_df %>%
select(starts_with("rmse")) %>%
pivot_longer(
everything(),
names_to = "model",
values_to = "rmse",
names_prefix = "rmse_") %>%
mutate(model = fct_inorder(model)) %>%
ggplot(aes(x = model, y = rmse)) + geom_violin()+
ylab("rmse Value") +
xlab("Models") +
labs(title = "Violin Plot of rmse Values for Three Models")
```
From the violin plot, we can see that model 3 has the lowest rmse, which is the model chosen by forward selection. The first model has only three variables, pld, d, and rank. The second model has four variables, pld, d, rank, and gf. The third model has five variables, pld, d, rank, gf, and ga. After comparing their rmse, we get the final model - w = -1.554369 + 0.661219pld - 0.622216d + 0.015920rank + 0.153794gf -0.225378ga.
<br>
# Compare the summary of the Models
<br>
We use the `summary` function to compare the R^2 and p-values of the predictors.
```{r}
model1=lm(w~pld + d + rank , data = num)
model2=lm(w~pld + d + rank + gf, data = num)
model3=lm(w~pld + d + rank + gf + ga, data = num)
summary(model1)
summary(model2)
summary(model3)
```
<br>
By looking at the summary, we can see that model 3 has the highest adj. R^2 of 99.3%, which means our model explain over 99% of the variance in the outcome (wins). This result consists with the result we got from the cross-validation. Therefore, we decide our final model to include 5 variables, pld, d, rank, gf, and ga.
<br>
# Results/output from regression model
Our final model is w = -1.554369 + 0.661219pld - 0.622216d + 0.015920rank + 0.153794gf -0.225378ga.
After comparing the summary of the three models, we found that the model with variables: pld, d, rank, gf, ga, has the lowest rmse and highest adj.R^2 (about 99%), which means our model explain over 99% of the variance in the outcome (wins). The most significant variable in the model is ga with coefficient of -0.225378. That means if the number of goals scored against a country in the World Cup (since 1990) increases by 1 goal, that would decrease the number of winning by 0.225378, while holding other variables constant.