Disclosure
The content produced by this application is for informational purposes only, you should not construe any such information or other material as legal, tax, investment, financial, or other advice. Nothing contained in this article, Git Repo or withing the output produced by this application constitutes a solicitation, recommendation, endorsement, or offer by any member involved working on this project, any company they represent or any third party service provider to buy or sell any securities or other financial instruments in this or in in any other jurisdiction in which such solicitation or offer would be unlawful under the securities laws of such jurisdiction.
The use of word "opinion" or "recommendation" or any other word with a similar meaning, in this article, within the Technitrade application, or within information produced by the application is for demonstration purposes only, and is not a recommendation to buy or sell any securities or other financial instruments!
This application was created solely to satisfy the requirements of Columbia University FinTech Bootcamp Project #2 Homework, and the results produced by this application may be incorrect.
- Overview
- Application Logic
- Libraries
- Flask API
- SQL Database
- Interface
- Technical Analysis
- Machine Learning Model
- Sentiment Analysis
- Team
Technitrade lets user track a portfolio of stocks, periodically getting News Sentiment, Twitter Sentiment, and Machine Learning AI Stock Opinion. The machine learning model calculates "opinion" based on market data and technical analysis, while the investor sentiment calculated by natural language processing analysis of recent news articles and Tweets.
The user interacts with the program via an Amazon Lex chatbot. The machine learning analysis is performed using LSTM (Long Short-Term Memory) model. The model is trained on technical analysis indicators. Sentiment analysis is performed by Google Cloud Natural Language using NewsAPI and Twitter APIs as data source.
Demo Jupyter Notebooks
- Technical Analysis Demo :
technicals_demo.ipynb
- Machine Learning Demo :
lstm_demo.ipynb
- Sentiment Analysis Demo :
nlp_demo.ipynb
Production Code
- Flask API
- Application (Production Machine Learning LSTM model, Sentiment Analysis, etc. )
- Infrastructure
- Docker container
Can all be found here: code/api/
- Lambda file can be viewed here:
lambda.py
The following libraries are used:
- Numpy - "The fundamental package for scientific computing with Python".
- Pandas - data analysis and manipulation tool.
- Matplotlib - comprehensive library for creating static, animated, and interactive visualizations in Python.
- boto3 - AWS SDK for Python to create, configure, and manage AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). The SDK provides an object-oriented API as well as low-level access to AWS services.
- psycopg2 - database adapter for the Python programming language.
- Dotenv - python-dotenv reads key-value pairs from a .env file and can set them as environment variables.
- Alpaca Trade API - Internet brokerage and market data connection service.
- NewsAPI - NewsAPI locates articles and breaking news headlines from news sources and blogs across the web and returns them as JSON.
- Twitter API - Twitter API enables programmatic access to Twitter.
- tweepy - An easy-to-use Python library for accessing the Twitter API.
- Scikit-Learn - Machine learning library for python
- Tensorflow - end-to-end open source platform for machine learning.
- Keras - aython API used to interact with Tensorflow.
- NLTK - leading platform for building Python programs to work with human language data.
- Google Cloud language_v1 - API that connects to Google Cloud Natural Language
- Flask - micro web framework written in Python.
- AWS Lex Bot - service for building conversational interfaces into any application using voice and text.
- Twilio - service to programaticly send and receve SMS messages via Python API.
- Twilio SendGrid - communication platform for transactional and marketing email.
User interfaces with the application using SMS enabled by Twilio service. Twilio service connects to AWS Lex Bot which handles all the conversation logig.
Amazon Lex Bot gathers the following user info:
- Name
- n number of portfolio stock tickers
The user gets the News Sentiment, Twitter Sentiment, and Machine Learning AI Stock Opinion via periodic emails. The first email is received right after the Machine Learning model finished training and is fitted with data to predict future stock prices.
The emails are distributed via Twilio's SendGrid service.
The resulting email looks something like this:
A Flask API was built in order to handle all tasks between the:
- Amazon Lex Bot via Lambda
- Data sources: Market Data Connection (see
[code/marketdata/]
folder), NewsAPI, Twitter API - Technical Analysis module :
technicals.py
- Machine Learning module :
lstm_model.py
- Sentiment Analysis service
- Amazon RDS PostgreSQL server
All events are triggered by AWS Cloudwatch. AWS Lambda function handle all of the production python code.
-
Flask API services can he found here:
Project2API
-
Project Application code can be found here:
Project2Application
-
Project Infrastructure code can be found here:
Project2Infrastructure
The steps by which the Flask API executes application workflow is outlines in the table below.
Objective | Action | Trigger | |
---|---|---|---|
1 | User Data | User & Portfolio Creation | Amazon LEX |
2 | Model - Training | Trigger the API to run the training | Lambda / CloudWatch |
3 | Model - Training | Save the model in Amazon S3 | API |
4 | Model - Forecast | Forecast the tickers | Lambda / CloudWatch / API |
5 | User Data | Update the user portfolio | Lambda / CloudWatch / API |
6 | User Data | Send email to the users | Lambda / CloudWatch / API |
A PostgreSQL database hosted on Amazon RDS is utilized to store all the user data and machine learning models.
All database code can be viewed here: code/src/
Amazon Relational Database Service (Amazon RDS) makes it easy to set up, operate, and scale a relational database in the cloud. It provides cost-efficient and resizable capacity while automating time-consuming administration tasks such as hardware provisioning, database setup, patching and backups.
PostgreSQL is a powerful, open source object-relational database system with over 30 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.
psycopg2 was used to interface python with PostgreSQL database. pgAdmin was used for testing and debugging.
Technical analysis is performed via technicals
module. A demonstration of the module can be seen in technicals_demo.ipynb
RSI is a momentum indicator which measures the magnitude of recent price changes to evaluate overbought or oversold conditions in the price of a stock. [Investopedia]
Williams %R is a momentum indicator which measures overbought and oversold levels. It has a domain between 0 and -100.The Williams %R may be used to find entry and exit points in the market. [Investopedia]
Williams %R Equation
where:
Highest High = Highest price in the lookback period.
Close = Most recent closing price.
Lowest Low = Lowest price in the lookback period.
The money flow index (MFI) is an oscillator that ranges from 0 to 100. It is used to show the money flow (an approximation of the dollar value of a day's trading) over several days. [Wikipedia]
Money Flow Index Equation
- Positive money flow is calculated by adding the money flow of all the days where the typical price is higher than the previous day's typical price.- Negative money flow is calculated by adding the money flow of all the days where the typical price is lower than the previous day's typical price.
- If typical price is unchanged then that day is discarded.
- The money flow is divided into positive and negative money flow.
The stochastic oscillator is a momentum indicator comparing a particular closing price of a security to a range of its prices over a certain period of time. The sensitivity of the oscillator to market movements is reducible by adjusting that time period or by taking a moving average of the result. It is used to generate overbought and oversold trading signals, utilizing a 0–100 bounded range of values. [Investopedia]
Stochastic Oscillator Equation
where:
C = The most recent closing price
Lown = The lowest price traded of the n previous trading sessions
Highn = The highest price traded during the same n-day period
%K = The current value of the stochastic indicator
MACD is a trend-following momentum indicator that shows the relationship between two moving averages of a security’s price. The MACD is calculated by subtracting the 26-period exponential moving average (EMA) from the 12-period EMA. [Investopedia]
MACD Equation
Exponential moving average is a moving average that places a greater weight to most recent data points and less to the older data points. In finance, EMA reacts more significantly to recent price changes than a simple moving average (SMA)which applies an equal weight to all observations in the period. In statistics, a moving average (MA), also known as simple moving average (SMA) in finance, is a calculation used to analyze data points by creating a series of averages of different subsets of the full data set.
The moving average is a calculation used to smooth data and in finance used as a stock indicator. [Investopedia]
The exponential moving average is a type of moving average that gives more weight to recent prices in an attempt to make it more responsive to new information. [Investopedia]
EMA Equation
where:
EMAt = EMA today
EMAy = = EMA yesterday
Vt = Value today
s = smoothing
d = number of days
the high-low and close-open indicators are the difference between the high and low prices of the day and close and open prices of the day respectively.
A Bollinger Band® is a technical analysis tool defined by a set of trendlines plotted two standard deviations (positively and negatively) away from a simple moving average (SMA) of a security's price. Bollinger Bands® were developed and copyrighted by famous technical trader John Bollinger, designed to discover opportunities that give investors a higher probability of properly identifying when an asset is oversold or overbought. [Bollinger Bands],[Investopedia]
Bollinger Bands Equation
where:
σ = standard deviation
m = number of standard deviations
n = number of days in the smoothing period
LSTM (Long Short-Term Memory) model using TensorFlow and Keras is used. An example of the machine learning model code is provided in lstm_demo.ipynb
notebook.
This application utilizes LSTM (Long Short-Term Memory) machine learning model. LSTM model was developed by Sepp Hochreiter and published in Neural Computation in 1997 [Hochreiter 1997]. A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell Wikipedia.
TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets developers easily build and deploy ML powered applications.
Keras is an open-source software library that provides a Python interface for artificial neural networks. Keras acts as an interface for the TensorFlow library. Keras allows for easy implementation of TensorFlow methods without the need to build out complex machine learning infrastructure.
Data is acquired from Alpaca Trade API and processed using the technicals
module. The resulting DataFrame contains Closing
price and all of the technical indicators.
The market data is obtained by calling the ohlcv()
method within the alpaca
module. The methods takes a list
of tickers, as well as the start_data
and end_date
, and returns a pd.DataFrame
.
end_date = datetime.now().strftime('%Y-%m-%d')
start_date = (end_date - timedelta(days=1000)).strftime('%Y-%m-%d')
ohlcv_df = alpaca.ohlcv(['tickers'], start_date=start_date, end_date=end_date)
The TechnicalAnalysis
class must first be instantiated with the pd.DataFrame
containing market data.
tech_ind = technicals.TechnicalAnalysis(ohlcv_df)
tech_ind_df = tech_ind.get_all_technicals('ticker')
The LSTM model is contained within the MachineLearningModel
class located in the lstm_model
module. The class must first me instantiated with a pd.DataFrame
containing the technical analysis data.
my_model = lstm_model.MachineLearningModel(tech_ind_df)
Building and fitting the model is done by calling the build_model()
class method.
hist = my_model.build_model()
The model is then saved as an .h5
file.
my_model.save_model('model.h5')
The MachineLearningModel
is used to handle all machine learning methods. The build_model()
class method, builds and fits the model. The class method implements the following methodology:
The LSTM model is programmed to look back 100
days to predict 14
days. The number of features is set by the shape of the DataFrame.
n_steps_in = 100
n_steps_out = 14
n_features = tech_ind_df.shape[1]
A RobustScaler
is used to scale the technical analysis data [ScikitLearn].
sklearn.preprocessing.RobustScaler()
Scale features using statistics that are robust to outliers.
This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile). Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the transform method.
The DataFrame is then parsed to np.array
and spit into X
and y
subsets.
X, y = split_sequence(tech_ind_df.to_numpy(), n_steps_in, n_steps_out)
Where split_sequence()
is a helper method that splits the multivariate time sequences.
Sequential()
model is utilized as it groups a linear stack of layers into a tf.keras.Model [TensorFlow]
model = tf.keras.Sequential()
A hyperbolic tangent activation function is used : tanh
[TensorFlow]
activation_function = tf.keras.activations.tanh
Input and hidden layers
LSTM input and hidden layers are utilized. [TensorFlow]
The input layer contains 60
nodes, while the hidden layers contain 30
nodes by default but can be set by the administrator to n arbitrary amount by setting the n_nodes
variable. The number of hidden layers default to 1
but can also be modified by the administrator.
Hidden layers are added with a add_hidden_layers()
helper function.
n_nodes = 30
# input layer
model.add(LSTM(60,
activation=activation_function,
return_sequences=True,
input_shape=(n_steps_in, n_features)))
# hidden layers ...
model.add(LSTM(n_nodes, activation=activation_function, return_sequences=True))
Two dense layers are used in the model. Dense layers are added using add_dense_layers
class method.
model.add(Dense(30))
The model uses Adam optimizer (short for Adaptive Moment Estimation) [TensorFlow]. Adam is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments. Adam optimizer was developed by Diederik Kingma and Jimmy Ba and published in 2014 [Kingma et. al. 2014]. Adam optimizer is defined by its creators as "an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments."
optimizer = tf.keras.optimizers.Adam
The model uses Mean Squared Error loss function, which computes the mean of squares of errors between labels and predictions [TensorFlow]
loss = tf.keras.losses.MeanSquaredError
Model is trained for 16
epochs using 128
unit batch size. The validation split is 0.1
.
The model is then compiled and fit.
model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])
hist = model.fit(X, y, epochs=16, batch_size=128, validation_split=0.1)
An example of model training results with conducted with The Coca-Cola Company stock : KO.
Predictions are calculated with a validator()
helper method.
To forecast stock prices using the saved model, the application uses the ForecastPrice
class located within the lstm_model
module.
The module pre-processes the date using the aforementioned methods and then utilizes <code.model.predict() TensorFlow method.
The application accomplished this by:
- Getting stock prices for past
200
days usingalpaca
module - Getting technical indicators using the
get_all_technicals()
method withing thetechnicals.TechnicalAnalysis
class - Instantiating the
ForecastPrice
class with the technical data
forecast_model = lstm_model.ForecastPrice(tech_ind_df)
- Calling
forecast()
method within theForecastPrice
class
forecast = forecast_model.forecast()
ForecastPrice class handles all of the forecasting functions. The forecast()
class method implements the following methodology:
- Load model using
load_model
Keras method.
from tensorflow.keras.models import load_model
forecast_model = load_model("model.h5")
-
Pre-processes the data following the same methodology as MachineLearningModel class.
-
Predicts the prices.
forecasted_price = forecast_model.predict(tech_ind_df)
- Inverse scale the prices.
forecasted_price = scaler.inverse_transform(forecasted_price)[0]
If the predicted price 14
days from now is higher than the current price, the application will issue a buy "opinion", if the price is lower that the current price it will issue a sell "opinion" on the date of the highest predicted price.
Sentiment analysis is performed using the Google Cloud Natural Language service.
The data utilized in sentiment analysis is obtained from 2 sources:
Implementation of NewsAPI and Tweepy can be found in the demo notebook: nlp_demo.ipynb
The sentiment analysis implementation:
from google.cloud import language_v1
from google.oauth2.credentials import Credentials
def GetSentimentAnalysisGoogle(text_content):
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = '../your_credentials_file.json'
client = language_v1.LanguageServiceClient()
type_ = language_v1.Document.Type.PLAIN_TEXT
document = {'content': text_content, 'type_': type_}
encoding_type = language_v1.EncodingType.UTF8
response = client.analyze_sentiment(request={'document': document,
'encoding_type': encoding_type})
return {'score' : response.document_sentiment.score ,
'magnitude' : response.document_sentiment.magnitude}