ARIMA-models.Rmd

---
title: "Time Series Analysis - ARIMA Models"
author: "Abdullahi Hussein"
date: "03/09/2021"
output:
  pdf_document: default
  html_document:
    df_print: paged
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Exploring the data
In this article, we are going to use the USA quarterly GDP available in **astsa** library in R. 
We will start with loading the required libraries and the data.
```{r}
# if library not installed use 
# install.packages("astsa")
# loading the library
library(astsa) 
#loading the data
data(gdp)
#plotting the data. This is a time series data so we use the appropriate plots
plot.ts(gdp, col = "darkblue")
```

The the above plot we see that Y-axis values are two large. This means,  our data is skewed. Since, the data is 
skewed and all positive, we can use log transformation.

```{r}
# Takaing the log of the data
log_gdp = log(gdp)
# plotting the logged data
plot.ts(log_gdp, xlab = "Yearl Quearters", ylab = "Logged Queartly GDP",
        main = "Us logged Quarterly GDP from 1950 to 2020", col = "darkorange")
```

We can clearly see from the above plot, that the data has an upward trend. Therefore, the data is not stationary. To confirm this, 
we will plot the Auto-correlation function of the data. 

```{r}
# plotting the ACF of logged gdp data
acf(log_gdp, lag.max = 54, main = "ACF of Logged GDP")
```

The sample ACF, $\hat \rho(h)$ does to decay to zero fast enough as h, which is the time lag increases. 
This suggest the use of differencing transformation to remove non-stationarity of the data.

```{r}
# Taking the first difference of the logged data and plotting it.
plot(diff(log_gdp), xlab = "Yearly Quarters", ylab = "differenced logged gdp",
     main = "First difference of logged DGP data", col = "darkblue")
# Adding a line 
abline(h = 0.01, col = "red")
```

From the plot of the differenced data, we can see the trend element of the data has been removed, and the data seems stationary.

We can plot the ACF and the PACF (Partial Auto-correlation function) of the differenced data to decide what models need to be fitted.

```{r}
# plots of ACF and PACF of differenced data
P_acf = acf2(diff(log_gdp), max.lag = 54, main = "First difference of logged gdp")
```

From the two plots above, it seems that PACF cuts of at lag 1 and ACF tails off at lag1. This 
suggests ARIMA(1, 1, 0) model for the logged data. We can also argue that  ACF cuts of at lag 2 and
PACF tails off at lag 2. This suggest ARIMA(0, 1, 2) for the logged data.

## Fitting ARIMA models
Our initial exploration suggested two different ARIMA models. we are going to fit both of them 
and decide which one explains the data lag the best. 

```{r,message=FALSE}
# Fitting ARIMA(1,1, 0). Note, this is an Autogressive model with lag 1.
AR1 = sarima(log_gdp, 1, 1, 0,)

```

We can see that **Standardized Residual** plot looks like the following white noise plot. This 
means, there is no indication of non constant variance.
```{r}
white_noise = rnorm(150,0, 0.1)
plot.ts(white_noise)
```

ACF residuals are within the blue lines. This suggest, there is no serial correlation of the 
residuals at different lags. We can see almost all of ACF residuals fall within 
$2/(\sqrt{n})$ where $n$ is th lenght of our dataset. This is given by 
`r 2/sqrt(length(log_gdp))`

From Normal Q-Q plot of the std Residuals, we see that most of the residuals are normally 
distributed except may be few outliers at the two ends. 

However, We can see that most of the p-value for **Ljung-Box statistic**  are within a significance 
level of 5%. Therefore, we will not reject the hypothesis that the residuals have some serial 
correlation (if most of the p-vluess are on or below the blue line, there is indication of 
autocorrelation).

```{r}
#coefficents and their p-values
AR1$ttable
```

We can see that the p-value fo the coefficiens and the constant term are both 
significant. 

Next, We start fitting ARIMA(0, 1, 2) model to the logged data and go through 
the same diagnosis as the model above.
```{r}
# Fitting ARIMA (0, 1, 2) model to the logged data
MA2 = sarima(log_gdp, 0, 1, 2)
```

The standardized Residuals plot, the ACF of Residuals does not show show any 
indication of serial correlation. The normal Q-Q Plot of Std Residuals suggest 
most of the residuals are Normally  distributed except for few outliers at the 
end of the two ends. The LJung-Box statistics plot p-values are all above the 
reasonable significant level for most lags, unlike the ARIM(1, 1, 0) above. 
```{r}
MA2$ttable
```
We can see that the coefficients and the constant term of this model are all 
significant

Using Ljung-Box statistics, we decided to proceed with ARIMA(0, 1, 2) for the 
prediction. For further Proof of the superiority of ARIMA(0, 1, 2), we can take 
a look at the AIC, AICc, BIC of each model. 
```{r}
# AIC, AICc, BIC of ARIMA(1,1,0) and ARIMA(0, 1, 2)
entries = data.frame("ARIMA(1,1,0)" = c(AR1$AIC, AR1$AICc, AR1$BIC),
                     "ARIMA(0,1,2)" = c(MA2$AIC, MA2$AICc, MA2$BIC),
                     row.names = c("AIC", "AICc","BIC"))
knitr::kable(entries)
```

We can that the AIC, and AICc of ARIMA(0, 1, 2) are less than that of 
ARIMA(1,1,0) although the difference is not that much. Therefore, we are going 
to use ARIMA(0, 1, 2) for predictions.

## Predictions
We will try and forecast the next 10 quarters of the logged GDP data using 
ARIMA(0, 1, 2) model 
```{r}
#predicting next 10 quarters of the logged GDP data
pred = sarima.for(log_gdp, 10, 0, 1, 2)

#95% confidence prediction interval and taking the anti-log
#Upper 95% CI for the prediction and taking anti-log
upper = exp(pred$pred + qnorm(0.975)*pred$se)
#Lower 95 CI for the prediction and taking anti-log
lower = exp(pred$pred - qnorm(0.975)*pred$se)
#Putting the CI interval into a table
pred_data = data.frame(Predicted = exp(pred$pred), Lower_95 = lower, 
                       Upper_95 = upper)

knitr::kable(pred_data)
```

We can see that our intervals are not that wide, which might mean our model 
might have done good job of fitting the data well. But, we can see towards the 
end as we move to values far away from the observed data, the prediction 
interval gets wider. This is expected for any prediction. Our model is likely 
to do better job of predicting values closer to the observed data. 

## Spectral Analysis:
For the rest of this document, we will perform spectral analysis to identify 
the three most dominant periods. In most of the time series data, cyclic 
dynamics are the rule rather than the exception. Spectral analysis enables us 
to estimate the periodicity of time series data. What spectral analysis is in a 
non-technical explanation is the decomposition of a time series into underlying 
sin and cosine functions of different frequencies. The aim to determine those 
frequencies that seem particularly strong.
```{r}
library(MASS)
# spectral analysis for GDP data for USA
gdp.per = mvspec(gdp, log = "yes")

# Frequencies 
P = gdp.per$details[order(gdp.per$details[, 3], decreasing = TRUE), ]
# the three most dominant Frequencies
P[1:3,]
#95% confidence intervals  for the three dominant frequencies
gdp_u1 = 2*P[1, 3]/qchisq(0.025, 2)
gdp_l1 = 2*P[1, 3]/qchisq(0.975, 2)
gdp_u2 = 2*P[2, 3]/qchisq(0.025, 2)
gdp_l2 = 2*P[2, 3]/qchisq(0.975, 2)
gdp_u3 = 2*P[3, 3]/qchisq(0.025, 2)
gdp_l3 = 2*P[3, 3]/qchisq(0.975, 2)
# creating tables for nice formatting.
knitr::kable(data.frame(series = rep("gdp", 3), Dominant.Freq = P[1:3, 1],
                        Spec = P[1:3,3], Lower = c(gdp_l1, gdp_l2, gdp_l3), 
                        Upper = c(gdp_u1, gdp_u2, gdp_u3)))
```

We see that the confidence intervals are extremely wide for all three cases, 
therefore, we can't establish the significance of the peaks.