-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSpectral Analysis.Rmd
279 lines (217 loc) · 13.4 KB
/
Spectral Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
---
title: "Spectral Analysis"
output: html_notebook
author: "Luca Laringe"
---
In this notebook I will give some examples for how to use some functions and tools related to spectral analysis in the context of time-series analysis. Any time-series can be viewed as a discrete signal, which in turn thanks to the spectral representation theorem, can be seen as an infinite sum of sinusoidal waves of different frequency and intensity. Spectral analysis focuses on analyzing the frequencies that explain most of the variance of the signal, thus can help to uncover cyclical patterns.
```{r, include=FALSE}
library(ggplot2)
# ggplot theme
par(mfrow = c(2,1), mar = c(3,5,4,4), cex = 0.4)
th <- theme_minimal() +
theme(plot.title = element_text(size=11, face="bold"),
axis.title.x = element_text(size=9),
axis.title.y = element_text(size=9),
legend.text = element_text(size=7),
legend.title = element_text(size=9),
axis.text = element_text(size=7),
strip.text = element_text(size=10, face = "italic"),
panel.border = element_rect(colour = "black", fill = NA, size = 1))
```
```{r, message=FALSE, warning = FALSE, output=FALSE}
# Start by importing a dataset
library(tseries)
library(forecast)
library(ggplot2)
library(quantmod)
library(xts)
# We download the NSA (not seasonally adjusted) unemployement rate from the FRED database
getSymbols('UNRATENSA',src='FRED')
unemployement_rate <- UNRATENSA
autoplot(unemployement_rate) + th
```
The series we just downloaded plotted is the unemployment rate in the US (in %, not seaonally adjusted). We are going to explore this time-series in the frequency domain using spectral analysis, in order to understand seasonalities and longer-term cycles. I will remove the 2020 values from the series, given the fact that due to the covid-19 crisis they are clearly outliers in the series. At the end of the notebook, I will separate the 2019 year from the "training" sample and build a moded to forecast 2019 data, making use of discoveries from the spectral analysis.
```{r, message=FALSE, warning = FALSE}
# transform xts object into ts one
unemployement_rate <- ts(unemployement_rate, start = c(1948,1) , end = c(2020,8), frequency = 12)
# Keep 2019 test data in a separate object
unemployement_rate_train <- window(unemployement_rate, start = c(1948,1), end = c(2018,12))
unemployement_rate_test <- window(unemployement_rate, start = c(2019,1), end = c(2019,12))
autoplot(as.xts(unemployement_rate_train)) + th
```
Before turning to spectral analysis, let's first conduct a classic autocorrelation analysis. A first step to analyse seasonality patterns would be to plot the autocorrelation chart for the both the catual series and the differenced series, plus a seasonal boxplot. The latter shows the distribution of the target variable conditional to being in a certain month.
```{r}
ggAcf(unemployement_rate_train, lag.max = 50) + th
ggAcf(diff(unemployement_rate_train, lag=1), lag.max = 50) + th
```
Autocorrelations peak at 12 because every month is related to the same month ot the previous year. Same holds for multiples of 12 such as 24, 36 etc... Autocorrelations bottom at 6 (18, 24, 30...) following from the previous point: since a yearly cycle is present, the bottom of the cycle is going to be halfway in between the peaks.
```{r}
boxplot(unemployement_rate_train~cycle(unemployement_rate_train),xlab="Date", ylab = "Unemployement rate (%)" ,main ="US Unemployement rate from 1948 to 2019")
```
As we cas see it seems that the unemployement rate tends to be higher during winter months, lower during summer ones.This method is useful to analyse seasonalities that stem from the natural course of the year. Another step could be to use time-series decomposition. The target series is decomposed into a trend, seasonal (which we are now intrested in) and finally a random component.
```{r}
decompose_unemployement_rate_train <- decompose(unemployement_rate_train)
plot(decompose_unemployement_rate_train)
```
Finally we come to spectral analysis. Time-series spectral analysis analyses a time-series in the frequency domain, rather than in the time domain. In particular, by viewing a time-series as a sum of many sinusoidal functions with different frequencies, it allows us to understand which frequencies explain the majority of the time-series' variance, thus allowing to better understand its seasonalities and cycles. Since we already have a clear picture of the seasonalities with 1 year cycle (from the boxlpot) we could difference the data (year on year) and focus on the longer term cycles. Doing so, will allow us to disregard high frequencies cycles (more than once per year) and focus on lower frequencies, thus greater periods cycles.
Before analysing real data, I will simulate a dataset with a known cyclical pattern to better explain how the spectrum estimation works. I will keep the same structure as the unemployement data so to directly apply the same functions and methodology later.
```{r}
# Keep the same time as the unemployement series
period_yearly <- 7 # cycle repeats every period_yearly years
frequency_yearly <- 1/period_yearly
sampling_interval <- 1/12 # Means we measure observations each month
t <- time(unemployement_rate)
x <- seq(1,length(t), by=1/12)
y <- 8 + 2*sin(2*frequency_yearly*pi*x)
cyclical_ts <- ts(y, start = c(1948,1) , end = c(2020,8), frequency = 12)
autoplot(cyclical_ts) + th
```
```{r}
# Let's now add a random disturbance
epsilon <- rnorm(length(t), 0, 1)
cyclical_ts <- cyclical_ts + epsilon
differenced_cyclical_ts <- diff(cyclical_ts, lag = 12)
autoplot(cyclical_ts) +th
autoplot(differenced_cyclical_ts) +th
```
Let's estimate the cycle period using the sample periodogram. The sample periodogram is an estimation of the population spectrum. The population spectrum is a function of the frequency, rather than time, and integratved over a certain interval of frequencies it tells us the portion of the variance explained by sinusoidal waves with those frequencies. The sample periodogram can be estimated using the R function "spectrum". Since the classic sample periodogram does not have good asymptotic properties, sometimes its estimates are either smoothed (usind different kernel structures) or are parametric (the random proccess usually assumed is an AR one).
The spectrum function defaults to a logarithmic scale for the spectrum values, but we can change this by setting the log parameter to ”no”. Moreover, in general the default frequency axis is in cycles per sampling interval (in our case a sampling interval is a month, which is 1/12 of a year). It is more intuitive to convert the frequency axis to cycles per unit time (let's say in out case a year is a unt time), we can do this by extracting the frequency values that R returns and dividing by the length of the sampling interval. Nevertheless, since we are working with ts objects, this is already done automatically by R, so we do not need to worry about it. Finally, we should also multiply the estimated spectral density by 2 so that the area under the periodogram actually equals the variance of the time series.
```{r}
# Smoothed Sample Periodogram
# Smoothing redices the number of peaks and may blur what we are looking for. Argument span indicates the number of spikes in the kernel
sp <- spectrum(differenced_cyclical_ts,log="no",span=4,plot=FALSE)
spx <- sp$freq #/sampling_interval # (adjusts frequencies to reflect the sampling interval) (this is done automatically for ts object which have already a frequency in them)
spy <- 2*sp$spec # multiplies by 2 so that integral sums to variance
plot(spy~spx,xlab="frequency",ylab="spectral density",type="l")
```
```{r}
# Let us find he frequency of our estimated cycle
cycle_freq=spx[which.max(spy)] #Extract the dominant frequency
print(cycle_freq)
```
```{r}
# The cycle frequency is corresponding to a period:
cycle_period <- 1/cycle_freq
print(cycle_period)
```
The estimated cycle period id 6.54 years, not very far from the period of 7 upon which we buildt the series. We can go further, let's say we want to compute the fraction of variance which is explained by frequencies within a=0 and b=1/3 corresponding to periods greater than 3 years. In order to to so, we have to integrate the spectrum density between a and b. Since right now the spectrum density is just a set of points, we need to interpolate them in order to create an actual function that R can integrate.
```{r}
# Approximate spectral density function by linear interpolation
sp_density <- approxfun(spx, spy, yleft=0, yright = 0)
# You can also approximate it with a non-linear spline
sp_density_spline <- splinefun(spx, spy)
plot(sp_density)
```
```{r}
# Check that integral over all spectrum sums to the variace of the time series
#integrate(Vectorize(sp_density), lower = -Inf, upper = Inf)
library("pracma") # For numerical integration algorithms
# Linear interpolation
quad(Vectorize(sp_density), min(spx), max(spx)) # adaptive Simpson
#quadgk(Vectorize(sp_density), min(spx), max(spx)) # adaptive Gauss-Kronrod
#quadl(Vectorize(sp_density), min(spx), max(spx)) # adaptive Lobatto
# Spline interpolation
#quad(Vectorize(sp_density_spline), min(spx), max(spx)) # adaptive Simpson
#quadgk(Vectorize(sp_density_spline), min(spx), max(spx)) # adaptive Gauss-Kronrod
#quadl(Vectorize(sp_density_spline), min(spx), max(spx)) # adaptive Lobatto
```
```{r}
var(differenced_cyclical_ts)
```
As we can see the integral of the spectral density is very close to the variance of the series. The small error can be attributed to the function approximation. We can now compute the fraction of variance explained by cycles longer than 3 years.
```{r}
quad(Vectorize(sp_density), 0, 1/3)/var(differenced_cyclical_ts)
```
It is almost 50%. Let us now turn to the real unempoyment data. As we said before, let us first seasonally difference the series Then we will carry on with the spectral analysis.
```{r}
unemployement_rate_train_diff <- diff(unemployement_rate_train, lag = 12)
autoplot(unemployement_rate_train_diff) + th
```
```{r}
# Let's plot the auto correlation of the newly defined series
ggAcf(unemployement_rate_train_diff) + th
```
Autocorrelations up until 12 follows from the autocorrelations of the initial series. Let's now estimate the spectrum.
```{r}
# Sample Periodogram
#spectrum(x=unemployement_rate_train_diff, log="no")
# Smoothed Sample Periodogram
sp <- spectrum(unemployement_rate_train_diff, span=4, log="no", plot=FALSE)
# Parametric spectrum estimation assumig AR process
sp_ar <- spectrum(unemployement_rate_train_diff, method="ar",log="no", plot=FALSE)
plot(sp$freq[1:100], sp$spec[1:100], type="l")
plot(sp_ar$freq[1:100], sp_ar$spec[1:100], type="l")
```
```{r}
# Let us find the frequency of our estimated cycle
spx <- sp$freq
spy <- 2*sp$spec
cycle_frequency <- spx[which.max(spy)]
print(cycle_frequency)
```
```{r}
# The cycle frequency is corresponding to a period of:
cycle_period <- (1/sp$freq[which.max(sp$spec)])
print(cycle_period)
```
Hence, according to this spectral analysis, an unemplyment cycle lasts approximately 6 years, which is in line with literature that estimates the mean duration of a business cycle to be about 5.5 years.
To conclude, let us fit a time series model and test our predictions against 2019 data.
```{r, message=FALSE, warning=FALSE}
model <- auto.arima(unemployement_rate_train, seasonal = TRUE)
f <- forecast(model, 12)
y_hat <- f$mean
lower<- f$lower[,2]
upper<- f$upper[,2]
actual <- unemployement_rate_test
library(lubridate)
time <- as.numeric(time(actual))
## 'POSIXct, POSIXt' object
time <- date_decimal(time)
data <- data.frame(time, y_hat, lower, upper, actual)
ggplot(data, aes(x = time, y = actual)) + th +
scale_y_continuous(expand = c(0,0)) +
geom_line(aes(y = actual)) +
geom_line(aes(y = y_hat), col="red") +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha=0.1)
```
```{r}
# Accuracy measures
accuracy(y_hat, actual)
```
Let's see if we can improve the model with the addission of a regressor that captures the cycles of period 6 years found with spectral analysis.
```{r}
# Generate additional regressors
freq = 1/6
period = 1/freq
sampling_interval <- 1/12 # Means we measure observations each month
x <- seq(1,length(t), by=1/12)
X <- sin(frequency_yearly*pi*x)
X <- ts(X, c(1948,1), c(2019,12), frequency = 12)
X_train <- window(X, start = c(1948,1), end = c(2018,12))
X_test <- window(X, start = c(2019,1), end = c(2019,12))
plot(X_train)
```
```{r}
model <- auto.arima(unemployement_rate_train, seasonal = TRUE, xreg = X_train)
f <- forecast(model, 12, xreg = X_test)
y_hat <- f$mean
lower<- f$lower[,2]
upper<- f$upper[,2]
actual <- unemployement_rate_test
library(lubridate)
time <- as.numeric(time(actual))
## 'POSIXct, POSIXt' object
time <- date_decimal(time)
data <- data.frame(time, y_hat, lower, upper, actual)
ggplot(data, aes(x = time, y = actual)) + th +
scale_y_continuous(expand = c(0,0)) +
geom_line(aes(y = actual)) +
geom_line(aes(y = y_hat), col="red") +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha=0.1)
```
```{r}
# Compare accuracy measures
# Accuracy measures
accuracy(y_hat, actual)
```
It seems that including the additional regressor that captures the 6 years cycles increased the accuracy of the model on the test set. For instance, the RMSE went from 0.63 to 0.60.