-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathARIMA-models.Rmd
199 lines (167 loc) · 7.6 KB
/
ARIMA-models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
---
title: "Time Series Analysis - ARIMA Models"
author: "Abdullahi Hussein"
date: "03/09/2021"
output:
pdf_document: default
html_document:
df_print: paged
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Exploring the data
In this article, we are going to use the USA quarterly GDP available in **astsa** library in R.
We will start with loading the required libraries and the data.
```{r}
# if library not installed use
# install.packages("astsa")
# loading the library
library(astsa)
#loading the data
data(gdp)
#plotting the data. This is a time series data so we use the appropriate plots
plot.ts(gdp, col = "darkblue")
```
The the above plot we see that Y-axis values are two large. This means, our data is skewed. Since, the data is
skewed and all positive, we can use log transformation.
```{r}
# Takaing the log of the data
log_gdp = log(gdp)
# plotting the logged data
plot.ts(log_gdp, xlab = "Yearl Quearters", ylab = "Logged Queartly GDP",
main = "Us logged Quarterly GDP from 1950 to 2020", col = "darkorange")
```
We can clearly see from the above plot, that the data has an upward trend. Therefore, the data is not stationary. To confirm this,
we will plot the Auto-correlation function of the data.
```{r}
# plotting the ACF of logged gdp data
acf(log_gdp, lag.max = 54, main = "ACF of Logged GDP")
```
The sample ACF, $\hat \rho(h)$ does to decay to zero fast enough as h, which is the time lag increases.
This suggest the use of differencing transformation to remove non-stationarity of the data.
```{r}
# Taking the first difference of the logged data and plotting it.
plot(diff(log_gdp), xlab = "Yearly Quarters", ylab = "differenced logged gdp",
main = "First difference of logged DGP data", col = "darkblue")
# Adding a line
abline(h = 0.01, col = "red")
```
From the plot of the differenced data, we can see the trend element of the data has been removed, and the data seems stationary.
We can plot the ACF and the PACF (Partial Auto-correlation function) of the differenced data to decide what models need to be fitted.
```{r}
# plots of ACF and PACF of differenced data
P_acf = acf2(diff(log_gdp), max.lag = 54, main = "First difference of logged gdp")
```
From the two plots above, it seems that PACF cuts of at lag 1 and ACF tails off at lag1. This
suggests ARIMA(1, 1, 0) model for the logged data. We can also argue that ACF cuts of at lag 2 and
PACF tails off at lag 2. This suggest ARIMA(0, 1, 2) for the logged data.
## Fitting ARIMA models
Our initial exploration suggested two different ARIMA models. we are going to fit both of them
and decide which one explains the data lag the best.
```{r,message=FALSE}
# Fitting ARIMA(1,1, 0). Note, this is an Autogressive model with lag 1.
AR1 = sarima(log_gdp, 1, 1, 0,)
```
We can see that **Standardized Residual** plot looks like the following white noise plot. This
means, there is no indication of non constant variance.
```{r}
white_noise = rnorm(150,0, 0.1)
plot.ts(white_noise)
```
ACF residuals are within the blue lines. This suggest, there is no serial correlation of the
residuals at different lags. We can see almost all of ACF residuals fall within
$2/(\sqrt{n})$ where $n$ is th lenght of our dataset. This is given by
`r 2/sqrt(length(log_gdp))`
From Normal Q-Q plot of the std Residuals, we see that most of the residuals are normally
distributed except may be few outliers at the two ends.
However, We can see that most of the p-value for **Ljung-Box statistic** are within a significance
level of 5%. Therefore, we will not reject the hypothesis that the residuals have some serial
correlation (if most of the p-vluess are on or below the blue line, there is indication of
autocorrelation).
```{r}
#coefficents and their p-values
AR1$ttable
```
We can see that the p-value fo the coefficiens and the constant term are both
significant.
Next, We start fitting ARIMA(0, 1, 2) model to the logged data and go through
the same diagnosis as the model above.
```{r}
# Fitting ARIMA (0, 1, 2) model to the logged data
MA2 = sarima(log_gdp, 0, 1, 2)
```
The standardized Residuals plot, the ACF of Residuals does not show show any
indication of serial correlation. The normal Q-Q Plot of Std Residuals suggest
most of the residuals are Normally distributed except for few outliers at the
end of the two ends. The LJung-Box statistics plot p-values are all above the
reasonable significant level for most lags, unlike the ARIM(1, 1, 0) above.
```{r}
MA2$ttable
```
We can see that the coefficients and the constant term of this model are all
significant
Using Ljung-Box statistics, we decided to proceed with ARIMA(0, 1, 2) for the
prediction. For further Proof of the superiority of ARIMA(0, 1, 2), we can take
a look at the AIC, AICc, BIC of each model.
```{r}
# AIC, AICc, BIC of ARIMA(1,1,0) and ARIMA(0, 1, 2)
entries = data.frame("ARIMA(1,1,0)" = c(AR1$AIC, AR1$AICc, AR1$BIC),
"ARIMA(0,1,2)" = c(MA2$AIC, MA2$AICc, MA2$BIC),
row.names = c("AIC", "AICc","BIC"))
knitr::kable(entries)
```
We can that the AIC, and AICc of ARIMA(0, 1, 2) are less than that of
ARIMA(1,1,0) although the difference is not that much. Therefore, we are going
to use ARIMA(0, 1, 2) for predictions.
## Predictions
We will try and forecast the next 10 quarters of the logged GDP data using
ARIMA(0, 1, 2) model
```{r}
#predicting next 10 quarters of the logged GDP data
pred = sarima.for(log_gdp, 10, 0, 1, 2)
#95% confidence prediction interval and taking the anti-log
#Upper 95% CI for the prediction and taking anti-log
upper = exp(pred$pred + qnorm(0.975)*pred$se)
#Lower 95 CI for the prediction and taking anti-log
lower = exp(pred$pred - qnorm(0.975)*pred$se)
#Putting the CI interval into a table
pred_data = data.frame(Predicted = exp(pred$pred), Lower_95 = lower,
Upper_95 = upper)
knitr::kable(pred_data)
```
We can see that our intervals are not that wide, which might mean our model
might have done good job of fitting the data well. But, we can see towards the
end as we move to values far away from the observed data, the prediction
interval gets wider. This is expected for any prediction. Our model is likely
to do better job of predicting values closer to the observed data.
## Spectral Analysis:
For the rest of this document, we will perform spectral analysis to identify
the three most dominant periods. In most of the time series data, cyclic
dynamics are the rule rather than the exception. Spectral analysis enables us
to estimate the periodicity of time series data. What spectral analysis is in a
non-technical explanation is the decomposition of a time series into underlying
sin and cosine functions of different frequencies. The aim to determine those
frequencies that seem particularly strong.
```{r}
library(MASS)
# spectral analysis for GDP data for USA
gdp.per = mvspec(gdp, log = "yes")
# Frequencies
P = gdp.per$details[order(gdp.per$details[, 3], decreasing = TRUE), ]
# the three most dominant Frequencies
P[1:3,]
#95% confidence intervals for the three dominant frequencies
gdp_u1 = 2*P[1, 3]/qchisq(0.025, 2)
gdp_l1 = 2*P[1, 3]/qchisq(0.975, 2)
gdp_u2 = 2*P[2, 3]/qchisq(0.025, 2)
gdp_l2 = 2*P[2, 3]/qchisq(0.975, 2)
gdp_u3 = 2*P[3, 3]/qchisq(0.025, 2)
gdp_l3 = 2*P[3, 3]/qchisq(0.975, 2)
# creating tables for nice formatting.
knitr::kable(data.frame(series = rep("gdp", 3), Dominant.Freq = P[1:3, 1],
Spec = P[1:3,3], Lower = c(gdp_l1, gdp_l2, gdp_l3),
Upper = c(gdp_u1, gdp_u2, gdp_u3)))
```
We see that the confidence intervals are extremely wide for all three cases,
therefore, we can't establish the significance of the peaks.