modelbuilding.Rmd

---
title: "Model Building"
output: 
  html_document:
    code_folding: hide
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
options(scipen = 999)
knitr::opts_chunk$set(
	echo = TRUE,
	warning = FALSE,
	message = FALSE,
	fig.width = 12
)

## load packages
library(tidyverse)
library(corrplot)
library(glmnet)
library(modelr)

```

<br>

# Overview

We summarize the statistics of world cups from 1990 to 2018 for all the countries that made to the World Cup to help us make preditions of wining in the 2022 World Cup.

We tried different ways for model selection, including stepwise selection, forward and backward selection, and LASSO, with different selecting criterias, such as AIC, BIC, and p-values. After comparing the rmse, R-square and p-values, we decide to use the model from forward selection with p-value. 

While there are two variables that are on the borderline of our threshold, alpha = 0.05, we got three different models to include either or both of these two variables. Then, we used cross validation to check which model has the best performance.After comparing their rmse, we get the final model - w = -1.554369 + 0.661219pld - 0.622216d + 0.015920rank + 0.153794gf -0.225378ga.


<br>

# Potential predictors
<br>

#### VARIABLES AND DEFINITIONS

`country`: Individual country (represents national soccer team)

`part`: Number of times each country has participated in the World Cup (all time)

`pld`: Number of games played in the World Cup (since 1990)

`w`: Number of soccer games a country has won in the World Cup (since 1990)

`d`: Number of soccer games a country has drawn in the World Cup (since 1990)

`l`: Number of soccer games a country has lost in the World Cup (since 1990)

`gf`: Number of goals a country has scored against opponents in the World Cup (since 1990)

`ga`: Number of goals scored against a country in the World Cup (since 1990)

`gd`: Goal difference of country (goals scored by country minus goals scored against country)

`pts`: Number of points accumulated by country in the World Cup (since 1990)

`rank`: Country Official FIFA ranking (2022 Rankings)

`player`: Name of top record goal scorer

`goals`: Number of goals scored by top record goal scorer

`land_area_km`: Total area of the land-based  portions of a country’s geography (measured in square kilometers, km²)

`confederation`: Country FIFA confederation 

<br>

# Correlation Plot

In the dataset, we had 14 variables and 79 observations. To check for collinearity, we use correlation plots to explore the relationship between each pair of numeric predictors. To only include the numeric variables, we use the `select` function to exclude the qualitative variables, such as player, country, and gd. The deeper color in the plot means stronger correlation between the two variables.

```{r}
data <- read.csv("./data/worldcup_final.csv") %>% 
  na.omit() %>% 
  select(-country)
num = data[2:14] %>% 
  select(-gd, -player)
corrplot(cor(num),  diag = FALSE)
```

  
<br>

# Cross-validation

We decided to use forward selection for our final model, where we start with only the intercept and add one predictor at a time. The criteria we chose is p-value, which means we will select predictors to include based on p-values, until no more predictors having p-value less than 0.05 can be added to the model.
While there are two variables that are on the borderline of our threshold, alpha = 0.05, we created three different models to include either or both of these two variables. Then, we used cross validation to check which model has the best performance.

```{r}
cv_df =
  crossv_mc(data, 100) %>% 
  mutate(
    train = map(train, as_tibble),
    test = map(test, as_tibble))%>% 
  mutate(
    train = map(train, as_tibble),
    test = map(test, as_tibble))%>% 
  mutate(
    fit1  = map(train, ~lm(w ~ pld + d + rank, data = .x)),
    fit2  = map(train, ~lm(w ~ pld + d + rank + gf, data = .x)),
    fit3  = map(train, ~lm(w ~ pld + d + rank + gf + ga, data = .x))) %>% 
  mutate(
    rmse_fit1 = map2_dbl(fit1, test, ~rmse(model = .x, data = .y)),
    rmse_fit2 = map2_dbl(fit2, test, ~rmse(model = .x, data = .y)),
    rmse_fit3 = map2_dbl(fit3, test, ~rmse(model = .x, data = .y)))

cv_df %>% 
  select(starts_with("rmse")) %>% 
  pivot_longer(
    everything(),
    names_to = "model", 
    values_to = "rmse",
    names_prefix = "rmse_") %>% 
  mutate(model = fct_inorder(model)) %>% 
  ggplot(aes(x = model, y = rmse)) + geom_violin()+
  ylab("rmse Value") +
  xlab("Models") +
  labs(title = "Violin Plot of rmse Values for Three Models")

```


From the violin plot, we can see that model 3 has the lowest rmse, which is the model chosen by forward selection. The first model has only three variables, pld, d, and rank. The second model has four variables, pld, d, rank, and gf. The third model has five variables, pld, d, rank, gf, and ga. After comparing their rmse, we get the final model - w = -1.554369 + 0.661219pld - 0.622216d + 0.015920rank + 0.153794gf -0.225378ga.

<br>

# Compare the summary of the Models
<br>

We use the `summary` function to compare the R^2 and p-values of the predictors.

```{r}
model1=lm(w~pld + d + rank , data = num)
model2=lm(w~pld + d + rank + gf, data = num)
model3=lm(w~pld + d + rank + gf + ga, data = num)
summary(model1)
summary(model2)
summary(model3)

```
<br>

By looking at the summary, we can see that model 3 has the highest adj. R^2 of 99.3%, which means our model explain over 99% of the variance in the outcome (wins). This result consists with the result we got from the cross-validation. Therefore, we decide our final model to include 5 variables, pld, d, rank, gf, and ga.

<br>

# Results/output from regression model


Our final model is w = -1.554369 + 0.661219pld - 0.622216d + 0.015920rank + 0.153794gf -0.225378ga.

After comparing the summary of the three models, we found that the model with variables: pld, d, rank, gf, ga, has the lowest rmse and highest adj.R^2 (about 99%), which means our model explain over 99% of the variance in the outcome (wins). The most significant variable in the model is ga with coefficient of -0.225378. That means if the number of goals scored against a country in the World Cup (since 1990) increases by 1 goal, that would decrease the number of winning by 0.225378, while holding other variables constant.