Author: Mohammad Yousha
Predicting Kanoria Chemicals stock price using Long Short-Term Memory and Sentiment Analysis.
Progress:
- Study LSTM, SA and learn their application.
- Prepare data.
- Build model and make predictions
- Documentation.
Multi-Variate LSTM model: https://www.kaggle.com/code/amarsharma768/stock-price-prediction-using-lstm/notebook
There were 3 million+ datapoints for the news data, and just about 3500+ for the stock data.
Here is what I did in this step:
- Dropped news data that was from before the company's origin.
- Removed the data that was from days when the market was closed or the stocks weren't traded.
- There were multiple news headlines from different papers for each day, including ones useless for this purpose (entertainment, horoscopes, sports, etc.). I kept only the useful headlines and dropped the rest.
- I randomly selected one headline for each day (since there were still multiple), and finally merged the news and stock data into one dataset.
Final dataset sample:
date | headline_text | open | high | low | close | adj close | volume | |
---|---|---|---|---|---|---|---|---|
0 | 2007-01-08 | ULFA strikes again in Assam; kills nine people | 28.666666 | 28.666666 | 28.666666 | 28.666666 | 16.249998 | 3600.0 |
1 | 2007-01-09 | Marry-and-dump NRIs may face Indian law | 28.100000 | 28.600000 | 28.000000 | 28.083332 | 15.919325 | 2490.0 |
2 | 2007-01-10 | Kalam sets tone for engagement of global Indians | 27.566666 | 29.033333 | 27.333332 | 27.566666 | 15.626451 | 32694.0 |
3 | 2007-01-11 | Plan panel may cut SSA budget | 27.700001 | 28.416666 | 27.666666 | 28.000000 | 15.872088 | 4800.0 |
4 | 2007-01-12 | Bangladesh president resigns as chief advisor | 28.299999 | 28.600000 | 28.116667 | 28.433332 | 16.117727 | 13122.0 |
3754 rows × 8 columns
I have found a notebook explaining the usage of LSTM for stocks data and have modified the code in it to fit my use case. The original code can be found here.
Here are the changes I made:
- Changed the code to fit 4 features instead of two.
- Fixed the inverse transform parts.
- Reduced the number of epochs.
- Converted it into a function and made it reproducible.
- Train RMSE: 9.31
- Test RMSE: 4.76
- Before the break around 2019 is the train set, and after that is the test set.
Next 10 days prediction:
Since the data was limited to 30 March 2022 only because of the news headlines data; I had access to the actual stock price data for the days after that, and so I decided to compare my results with the actual price.
Next 10 days - Predicted vs Actual:
- From these results, it can be concluded that you should not use my model for actual investment.
Since my model also has to use the sentiment scoring that I performed for predictions, I have also made a Random Forest Regressor model. I have evaluated it on 3 fold cv, and have tuned it using RandomizedSearch.
I have used the features open
, close
, low
, adj close
, 'volume' of the stock data, and neg
, neu
, pos
from the sentiment scoring as the independent variables and high
as the target variable.
- RMSE: 30.8
- R2 Score: 0.69
Test set's Actual vs Predicted:
- The model's predictions seem to flatline around March-April 2021.
The Random Forest model does not seem to perfrom as well as the LSTM model. That may make sense as I read a scientific article stating that LSTM is currently one of the best models for stock prediction.
Still, it is best to try out different methods and find the best for yourself (as long as you have the time, of course).
I have made two tools to predict stock price:
- One that uses time-series data and an LSTM model.
- Another that uses sentiment scores from news headlines combined with the stock data and a Random Forest model.