The Goal of this project is to build a sentiment analysis micro service that could take a new EDGAR file in json format and generate sentiments for each statement in the referenced EDGAR file.
To build this service, we are creating a Sentiment analysis model that has been trained on “labeled”,“Edgar” datasets and deploy the model using a Flask App.
- We made an annotation pipeline by ingesting 44 earning call files from various companies.
- We have pre-processed each file to remove white spaces and special characters.
- Each earnings call file will now contain a list of sentences.
- We have made use of IBM Watson to label all these lines with sentiment scores.
- We then normalized the score to a scale of -1 negative to 1 positive.
- After that, we were successful in creating the file in a csv format containing sentences along with their scores and labelled them as positive and negative and eventually pushed the data to s3 Bucket.
- We made use of Airflow to automate the workflow by creating tasks to install the libraries and run our python file.
- We fetched the labelled csv stored in S3 bucket which was generated by running the Annotation Pipeline.
- We made use of BERT to fine-tune our model to perform sentiment analysis on our labelled data in order to train and validate our labelled data.
- We received an accuracy of 92 %
- We successfully saved the model in our BERT folder so as to load the model in our flask app.
- We made use of Airflow to automate the workflow by creating tasks to install the libraries and run our python file.
Your development and production environments are constructed by Docker. Install Docker for Desktop for your OS.
To verify that Docker is installed, run docker --version
.
In this directory, we have Dockerfile
, a blueprint for our development environment, and requirements.txt
that lists the python dependencies.
We made use of the following command to create our docker image:
- We are pulling the following tensorflow image which satisfies our tensorflow version
ARG BASE_IMG=tensorflow/tensorflow:2.1.0-py3-jupyter
FROM $BASE_IMG
ARG PROJECT_ROOT="."
ARG PROJECT_MOUNT_DIR="/"
ADD $PROJECT_ROOT $PROJECT_MOUNT_DIR
WORKDIR $PROJECT_MOUNT_DIR
RUN pip install --upgrade pip && \
pip install -r requirements.txt
ENTRYPOINT [ "python" ]
CMD [ "/app.py" ]
To serve the provided pre-trained model, follow these steps:
git clone
this repocd assignment_2/microservices/app/
docker build -t assign:latest .
-- this references theDockerfile
at.
(current directory) to build our Docker image & tags the docker image withassign:latest
- `docker run -it --rm -p 5000:5000 assign' -- this refers to the image we built to run a Docker container
If everything worked properly, you should now have a container running, which:
- Spins up a Flask server that accepts POST requests at http://0.0.0.0:5000/predict
- Runs our BERT sentiment classifier on the
"data"
field of the request (which should be a list of text strings: e.g.'{"data": ["this is the best!", "this is the worst!"]}'
) - Returns a response with the model's prediction (1 = positive sentiment, 0 = negative sentiment)
To test this, you can either:
- Write your own POST request (e.g. using Postman or
curl
), here is an example response:
{
"input": {
"data": [
"this is the best!",
"this is the worst!"
]
},
"pred": [
[
0.9935178756713867
],
[
0.6359626054763794
]
]
}
- We made use of FASTAPI for performing the inference pipeline test.
- After running the fastapi.py file we will receive the transcript for any of the 8 company we fill as input.
- We have preprocessed the transcript and created a list of sentences to input into our flask app.
- We will then receive output for all the sentence through our BERT model and finally input all the sentence and their respective predictions into a CSV file.
- We made use of Airflow to automate the workflow by creating tasks to install the libraries and run our python file.
Assignment_2/
├── Annotation_pipeline/
│ └── dags/
│ ├── annotation_pipeline.py
│ └── preprocessing.py
├── Inference_pipeline/
│ ├── app.py
│ ├── CompanyList.csv
│ ├── dags/
│ │ └── inference_pipeline.py
│ ├── fastapi.py
│ ├── inference-data/
│ │ ├── ACFN
│ │ ├── BLFS
│ │ ├── BMMJ
│ │ ├── CELTF
│ │ ├── GHLD
│ │ ├── IRIX
│ │ ├── KGFHF
│ │ └── TME
│ ├── main.py
│ └── requirements.txt
├── Microservices/
│ └── app/
│ ├── __init__.py
│ ├── app.py
│ ├── bert/
│ ├── Dockerfile
│ └── requirements.txt
├── README.md
├── requirements.txt
├── sec-edgar/
│ └── call_transcripts/
└── Training_pipeline/
└── dags/
├── bert.py
└── ml_pipeline.py
- Nidhi Goyal
- Kanika Damodarsingh Negi
- Rishvita Reddy Bhumireddy
- https://github.com/Harvard-IACS/2020-ComputeFest/tree/master/notebook_to_cloud/ml_deploy_demo
- https://github.com/holladileep/CSYE7245-Spring2021-Labs/tree/main/transcript-simulated-api
- https://www.docker.com/sites/default/files/d8/2019-09/docker-cheat-sheet.pdf
- https://www.tensorflow.org/tutorials/text/classify_text_with_bert