Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ankitasareen/My proposal for WoC DSC IEM project-TextSentimentAnalysis #6

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions DSC IEM/Ankita_Sareen- TextSentiment_Analysis
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@


# Developer - Winter of Code 2020

# Ankita Sareen

# Organisation Name - DSC IEM - TextSentimentAnalysis
* **Project:** [TextSentimentAnalysis](https://github.com/khanfarhan10/TextSentimentAnalysis)

# Overview
Text Sentiment Analysis in Python using Natural Language Processing (NLP) for Negative/Positive/Neutral sentimentDetection.
My work in this winter somehow made the accuracy of the model better , adding more of new features to it making it more beneficial for the communtiy on a regular purpose.

I am Ankita Sareen, an undergraduate student from NIT Rourkela, as part of Winter of Code.
This project was mentored by – Farhan Hai Khan and Tannistha Pal. The project had constant guidance from both of them.

## Contributions
* ### Selection of proper Dataset
Getting proper data for training models suitable to our requirements is important.Therefore, I picked up the well known "Amazon fine food reviews" dataset from the kaggle.

* ### Data preprocessing
I dropped the rows where score = 3 because neutral reviews don't provide value to the prediction.Next I created a column called positivity where any score above 3 is encoded as 1 otherwise 0. For other applications I could have applied various techniques to reduce memory usage, here we are going to just drop columns which we don't require and consume deep memory.

* ### Featurisation, EDA and Tf-idf
Using the seaborn library and wordcloud,tried to analyse the prediction column and the major words for the respective sentiment.
* Tokenization
In order to perform machine learning on text documents, I first need to turn these text content into numerical feature vectors that Scikit-Learn can use.
The simplest way to do so is to use *bags-of-words*. First I converted the text document into a matrix of tokens. The default configuration tokenizes the string, by extracting words of at least 2 letters or numbers, separated by word boundaries, converts everything to lowercase and builds a vocabulary using these tokens
* Sparse Matrix
I then transformed the document into a bag-of-words representation i.e matrix form. The result is stored in a sparse matrix i.e it has very few non zero elements.Rows represent the words in the document while columns represent the words in our training vocabulary.
* Used TF IDF over bag of words.


* ### Model selection and n gram modelling
Of all the models tried, Logistic Regression works the best after vectorisation and n gram modelling giving the accuracy of 93% and ROC_AUC score of 90% around which is pretty good model.

Training Script: https://colab.research.google.com/drive/14ekbzUx0dcEkzMdwBAhbdf4fflz6lL4H?usp=sharing

* ### Deploy the model in flask
Model was further integrated with flask and then deployed on heroku with further implementation of features such as summarisation.

#3-Improve, Design & Optimize the NLP Models

https://github.com/khanfarhan10/TextSentimentAnalysis/issues/3

* ### Built a Utility Tool for Zip File Upload
Added the feature of zip file upload, from which then csv files are extracted and predictions are made from the model. Also added one more feature, a link from where you can download the csv of the prediction dataframe generated.

#4-Improve Utility Tools - Zip File Upload

https://github.com/khanfarhan10/TextSentimentAnalysis/issues/4

# here are some of my PRs :
*Pull Requests:*
* **Model Improvised:** https://github.com/khanfarhan10/TextSentimentAnalysis/pull/18
* **Zip file upload #28:** https://github.com/khanfarhan10/TextSentimentAnalysis/pull/28
* **Small changes to app.py #34:** https://github.com/khanfarhan10/TextSentimentAnalysis/pull/34


## Future scope
While working in the project of TextSentiment Analysis not only we got aware about more applications of NLP but we also got ideas about how to make the project more meaningful for the community around. With the new inculcated features during this period, I feel this app will certainly be of some purpose.

## Acknowledgements
I had an enriching and exciting winter of 2020, working on this project under the Winter of Code program. I was glad to be mentored by Farhan (thanks for going through all my PRs!) and Tanishtha (who helped us with her vast knowledge in and around NLP). I express my sincere thanks to all my peers working as part of WoC.

Woc was a warm and unforgettable experience for me. Everytime I talk of Open Source, TextSentiment Analysis will be remembered fondly.