Machine Learning Stock Prediction/Foreasting in Python Using LSTM and SVR
About:
- The goal of this project aims to develop a model with high accuracy to predict the stock market and make statistically informed stock trading decisions.
- Model prediction provides us insights into which companies' stocks are worth investing in so that risk of losing money can be minimized, as well as help an individual or company to make more informed decisions on stock investing.
Dataset:
- Two datasets are selected from Kaggle, which are Johnson & Johnson (JNJ) and Exxon Mobil Corporation (XOM) [https://www.kaggle.com/datasets/rprkh15/sp500-stock-prices?select=MSI.csv]
- Each of the datasets consists of 8 columns which are the features, date, open, high, low, close, volume, dividends and stock splits and 15236 rows (Date range from 1962-01-02 to 2022-07-12)
- Data Features:
- Date: The date is in the format yy-mm-dd
- Open: Price of the stock when the market opens
- High: Highest price reached in the day
- Low: Lowest price reached in the day
- Close: Price of the stock when the market closes
- Volume: Number of shares traded in a day
- Dividends: The dividends of the stock
- Stock Splits: The stock splits of the company. In a stock split, a company divides its existing stock into multiple shares to boost liquidity.
Data Preparation:
- EDA
-
Average stock price movement for JNJ (Johnson & Johnson) and XOM (Exxon Mobil Corp) over a roughly 60-year period from 1962 to 2022. To observe the overall trend of each stock, we first plotted line charts for each based on the average column. Since 1990, the XOM and JNJ stock prices have been rising. JNJ's average stock value began to be higher than XOM's, but this changed after XOM's abrupt, dramatic growth between 2000 and 2010. However, the volume parameter of the stocks is extremely erratic and lacks any sort of discernible pattern. Inclination of the stock JNJ began to rise sharply after hitting its low point in the late 1990s. Not able to obtain any useful insights from the volume chart.
-
-
10-day, 100-day, and 365-day moving averages of each stock has been determined to identify the trends.
-
Based on the last four historical data points, JNJ's risk and return performance shows a minimum risk of 1.2% but surpasses it with a higher return (about 0.048% return). In contrast, the XOM exhibits a high risk but low expected return (about 0.041% return and 1.8% risk). Lower risk and greater return lead to the conclusion that JNJ is the more effective of the two (i.e., worth investing).
-
Association between the XOM and JNJ has been evaluated on a heatmap. It displays an extremely high correlation of 0.88. In other words, it indicates that XOM and JNJ are likely to have a similar business nature.
-
Check Missing Values to ensure data quality and data accuracy
-
Feature Selection (Using Pearson Correlation)
- From previous research, it is shown that for trend detection of stock prices, the Open High Low Close (OHLC) levels have high predictive potentials and are easier to predict compared to the traditional Close price.
- From the Pearson’s correlation coefficient performed for the JNJ and XOM datasets, at the threshold value of 0.9, the open, high, low and close attributes are highly correlated to each other, thus, the four columns will be selected as the target by transforming into an average which will be done in the data transformation step. The remaining variables, volume, stock splits and dividend are dropped from the dataset as they do not have high correlations to the target.
-
Feature Scaling
-
Scales the value of a variable to a value between 0 and 1 to achieve a higher precision. Higher precision is achieved by feature scaling because the values of the data are not spread out in a wide range.
-
The computational cost and memory consumption of the data in the dataset will also decrease when data with large value is reduced.
-
Before Scaling:
-
After scaling:
-
-
Model Splitting
- Before performing the train test split, copies of the JNJ and XOM datasets are made so that the full dataset can be retained while a portion of the dataset that have only stock prices of the latest 4 years can be obtained, resulting in 2 datasets, 1 full and 1 partial for the dataset of each company.
- The last 30 days of data are excluded from the dataset so that the data can be treated as extrapolation and validation can be performed for the extrapolated results.
- The 4 datasets are then split into train and test datasets and due to the data being time series data, sequence is important because historical data will be used to forecast the future data. Therefore, the train and test sets are not split randomly, instead, they are split according to the sequence, the first 80% of the data will form the train set and the last 20% will form the test set.
Model using predictive model:
-
Stacked LSTM:
- 4 layers, 100 hidden neurons and use Rectified Linear Unit activation (ReLu) for the activation function.
- Adam optimizer
- Number of epochs used to train the model is set to 100 and a batch size of 64.
- This model can handle sequential data, and it is able to memorize and consider previous inputs and outputs when performing prediction.
-
SVR:
- Find best parameter values using GridSearchCV
- XOM Full C=5, XOM Partial C=1000; JNJ C=20
- The Gamma value of the SVR model used for all the datasets is set to 0.001 except for the JNJ partial dataset that is set to 0.01.
- Kernel for XOM=Linear, JNJ=RBF
- Degree=2
Model Evaluation:
- Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE).
Advanced Analysis
- Prediction for next 30 days using LSTM.
Results: JNJ Dataset
- Stacked LSTM: Full JNJ dataset and the RMSE values obtained for the training set and testing set are 0.2167 and 2.3747 whereas the MAPE values obtained are 1.7414% and 1.2100%.
SVR: Full JNJ dataset and the RMSE values obtained for the training set and testing set are 14.3198 and 19.8299 whereas the MAPE values obtained are 66.8251% and 27.4985%.
-
Stacked LSTM: Partial JNJ dataset and the RMSE values obtained for the training set and testing set are 1.6375 and 1.8993 whereas the MAPE values obtained are 0.8517% and 0.8849%.
SVR: Partial JNJ dataset and the RMSE values obtained for the training set and testing set are 3.4937 and 3.7485 whereas the MAPE values obtained are 2.0837% and 1.7861%.
XOM Dataset
- From the results of the two models, it is shown that the results of the models when only a portion of the datasets are used is better compared to the results when the full datasets are used for forecasting. Thus, only the results generated from the models when the partial datasets are used will be used to compare the performance of the models for the respective datasets to determine the best model for each dataset.
Plots: Actual and forecasted values by Stacked LSTM model for the partial JNJ dataset
Actual and forecasted values by SVR model for the partial JNJ dataset
Actual and forecasted values by Stacked LSTM model for the partial XOM dataset
Actual and forecasted values by SVR model for the partial XOM dataset
Actual and forecasted 30 days of extrapolated data by Stacked LSTM model for the partial JNJ dataset
Actual and forecasted 30 days of extrapolated data by Stacked LSTM model for the partial XOM dataset
Conclusion:
- The Stacked LSTM model performs better and is more suitable for stock prediction than the SVR model.
- Scaling was very useful for improving accuracy for both models.
- In terms of accuracy, both partial and full datasets used for the Stacked LSTM produced RMSE and MAPE that are smaller compared to the SVR model. Hence, it is safe to say that the Stacked LSTM is a better estimator compared to the SVR model as it has smaller error.
- For the forecasting of the extrapolated 30 days, the performance of the Stacked LSTM model is better.