This repository contains a resarch on inflation forecasting with different Machine Learning methods in Peru. We are predicting headline and core inflation for two periods: 2019 and 2023.
The repository is divided as follows:
-
code
: This folder contains two subfoldersheadline_inflation
andcore_inflation
. In each one, there is the corresponding code for the prediction task for two periods2019
and2023
. Each one of this folders is divided in this sections:1_DataExtraction_###.ipynb
: In this notebook we use the API interface of the Central Bank of Reserve of Peru (BCRP) to extract our data. We do the corresponding transformations to each series and append them in a dataframedf_raw_###.csv
, which contains contemporary variables and will be used for visualization, anddf_lags_###.csv
, which additionally contains lagged variables and will be used for prediction tasks.2_DataVisualization_###.ipynb
: In this notebook we load the files created at the first notebook and do different visualization techniques in order to analize and understand more our data and the relationship between the variables. We do a pairplot and some heatmaps, as well as a graph of our input variables.3_Regression_###.ipynb
: This notebook contains the different regression and prediction tasks for all models
-
input
: CSV files created at the1_DataExtraction_###.ipynb
notebook and later used in other notebooks are saved here. -
output
: The results from the each jupyter notebook are saved here in a corresponding folder. -
modules
: Our functions are defined here. -
report
: The PDF of the research is found here.
The ###
at the end of each jupyter notebook indicates in which subfolder the notebook is located.
C##
: core_inflationC19
: core_inflation/2019C23
: core_inflation/2023
H##
: headline_inflationH19
: headline_inflation/2019H23
: headline_inflation/2023
The same interpretation can be used for ###
at the end of .csv
and .png
files in the output
folder to determine the corresponding variable and year.
Given the monthly price level
where
To perform the analysis of the predictive power of the different ML models, we first standardize the data to ensure all features are on the same scale. We divide our data into two consecutive sub-samples: training and testing. In the training sample, we will both fit the model and calibrate the hyperparameters. To do that, we implement a time series cross-validation, which, unlike other forms of cross-validation, considers the structure of the time series. The data is split into
-
Given a set of hyperparameters in the hyperparameter space
$\theta \in \Theta$ , we define a grid containing a set of$l$ hyperparameters to evaluate$G = {\theta_{1}, \theta_{2}, \ldots, \theta_{l}}$ . -
We train the model using the training data and the hyperparameters
$\theta_{i}$ ,$M_{i} = M(\theta_{i}, D_{\text{training}})$ . -
We evaluate the performance of the model with the validation set using the performance metric
$L_{i} = L(M_{i}, D_{\text{training}}, D_{\text{validation}})$ . -
The hyperparameter is chosen to minimize the performance metric (e.g., MSE):
The process is repeated until all folds have been used to calibrate the tuning parameters. The final hyperparameters are those that, on average, minimized the metrics across the different folds. This means that the ML model is trained a total of
This section provides an overview of the Machine Learning methods employed in the research, including their implementation and evaluation strategies. We are considering three econometric models (RW, VAR, ARIMA) and four machine learning models (LASSO, Ridge, EN and RF). Regarding the comparison methods, we are using the root mean square error (RMSE) and mean absolute percentage error (MAPE) of the preditec values against the real values.
The least absolute shrinkage and selection operator (LASSO) was first developed as a frequentist shrinkage method by Tibshirani (1996). In machine learning, it is used as a method for feature selection and regularization. The LASSO regression adds a penalty term that depends on the absolute value of the regression coefficients.
Given the following multivariate linear regression model
or equivalently:
Consider the same linear regression model as in LASSO. The ridge coefficients are chosen by imposing a penalty on the squared estimates:
where the term $\lambda|\beta|{2}^{2}$ is a regularization of type $\ell{2}$ and
As
Least Angle Regression (LARS) is an alternative method for feature selection and regularization introduced by Efron et al. (2004). Similarly to other regularization methods, it is particularly useful with high-dimensional data where the number of predictors is relatively large compared to the number of observations. This algorithm is also more computationally efficient compared with LASSO.
Elastic Net is another regularization technique used in linear regressions. It combines both
$$\text{Elastic Net} = \min_{\beta} { |y - X\beta|{2}^{2} + \lambda (\rho |\beta|{1} + (1 - \rho)|\beta|_{2}^{2}) }$$
where $|\beta|{1}$ and $|\beta|{2}^{2}$ correspond to their specific regularizations, and
Let us assume a regression tree model in the form
In the context of Random Forest (RF), each time a split in a tree is done, a random sample of
Support Vector Regression (SVR) is a type of Support Vector Machine (SVM) adapted for regression tasks. Unlike traditional regression models, which aim to minimize the error between predicted and actual values, SVR seeks to find a function that fits the data within a specified margin of tolerance, balancing between model complexity and prediction accuracy. SVR attempts to find a linear function
subject to:
$$(\mathbf{w} \cdot \mathbf{x}{i} + b) - y{i} \leq \alpha$$
where
To handle data points outside this margin, slack variables
$$\min_{\mathbf{w}, b, \xi, \xi^{}} \frac{1}{2} |\mathbf{w}|^{2} + C \sum_{i=1}^{n} (\xi_{i} + \xi_{i}^{})$$
subject to:
$$y_{i} - (\mathbf{w} \cdot \mathbf{x}{i} + b) \leq \alpha + \xi{i}$$
$$(\mathbf{w} \cdot \mathbf{x}{i} + b) - y{i} \leq \alpha + \xi_{i}^{*}$$
where
All models are implemented using the Scikit-Learn and XGBoost package in Python. Linear models are imported as the Lasso, Ridge, and ElasticNet functions respectively. Non-linear models are imported from RandomForestRegressor and XGBRegressor. All models are implemented with a random state = 2024. A cross validation followed by a grid-search is implemented using the TimeSeriesSplit and GridSearchCV modules from Scikit-Learn.
Given a
And we can define the forecast error as
Therefore, the RMSFE is defined by
and the MAPE is defined as
This repository is updated by Esteban Cabrera. Acknowledgement of QLAB-PUCP for the help provided during this investigation.