diff --git a/DSC IEM/Ankita-Sareen b/DSC IEM/Ankita-Sareen new file mode 100644 index 0000000..7213f1d --- /dev/null +++ b/DSC IEM/Ankita-Sareen @@ -0,0 +1,66 @@ + + +# Developer - Winter of Code 2020 + +# Ankita Sareen + +# Organisation Name - DSC IEM - TextSentimentAnalysis +* **Project:** [TextSentimentAnalysis](https://github.com/khanfarhan10/TextSentimentAnalysis) + +# Overview +Text Sentiment Analysis in Python using Natural Language Processing (NLP) for Negative/Positive/Neutral sentimentDetection. +My work in this winter somehow made the accuracy of the model better , adding more of new features to it making it more beneficial for the communtiy on a regular purpose. + +I am Ankita Sareen, an undergraduate student from NIT Rourkela, as part of Winter of Code. +This project was mentored by – Farhan Hai Khan and Tannistha Pal. The project had constant guidance from both of them. + +## Contributions +* ### Selection of proper Dataset +Getting proper data for training models suitable to our requirements is important.Therefore, I picked up the well known "Amazon fine food reviews" dataset from the kaggle. + +* ### Data preprocessing +I dropped the rows where score = 3 because neutral reviews don't provide value to the prediction.Next I created a column called positivity where any score above 3 is encoded as 1 otherwise 0. For other applications I could have applied various techniques to reduce memory usage, here we are going to just drop columns which we don't require and consume deep memory. + +* ### Featurisation, EDA and Tf-idf +Using the seaborn library and wordcloud,tried to analyse the prediction column and the major words for the respective sentiment. + * Tokenization + In order to perform machine learning on text documents, I first need to turn these text content into numerical feature vectors that Scikit-Learn can use. + The simplest way to do so is to use *bags-of-words*. First I converted the text document into a matrix of tokens. The default configuration tokenizes the string, by extracting words of at least 2 letters or numbers, separated by word boundaries, converts everything to lowercase and builds a vocabulary using these tokens + * Sparse Matrix + I then transformed the document into a bag-of-words representation i.e matrix form. The result is stored in a sparse matrix i.e it has very few non zero elements.Rows represent the words in the document while columns represent the words in our training vocabulary. + * Used TF IDF over bag of words. + + +* ### Model selection and n gram modelling + Of all the models tried, Logistic Regression works the best after vectorisation and n gram modelling giving the accuracy of 93% and ROC_AUC score of 90% around which is pretty good model. + + Training Script: https://colab.research.google.com/drive/14ekbzUx0dcEkzMdwBAhbdf4fflz6lL4H?usp=sharing + +* ### Deploy the model in flask +Model was further integrated with flask and then deployed on heroku with further implementation of features such as summarisation. + + #3-Improve, Design & Optimize the NLP Models + + https://github.com/khanfarhan10/TextSentimentAnalysis/issues/3 + +* ### Built a Utility Tool for Zip File Upload +Added the feature of zip file upload, from which then csv files are extracted and predictions are made from the model. Also added one more feature, a link from where you can download the csv of the prediction dataframe generated. + + #4-Improve Utility Tools - Zip File Upload + +https://github.com/khanfarhan10/TextSentimentAnalysis/issues/4 + +# here are some of my PRs : +*Pull Requests:* + * **Model Improvised:** https://github.com/khanfarhan10/TextSentimentAnalysis/pull/18 + * **Zip file upload #28:** https://github.com/khanfarhan10/TextSentimentAnalysis/pull/28 + * **Small changes to app.py #34:** https://github.com/khanfarhan10/TextSentimentAnalysis/pull/34 + + +## Future scope +While working in the project of TextSentiment Analysis not only we got aware about more applications of NLP but we also got ideas about how to make the project more meaningful for the community around. With the new inculcated features during this period, I feel this app will certainly be of some purpose. + +## Acknowledgements +I had an enriching and exciting winter of 2020, working on this project under the Winter of Code program. I was glad to be mentored by Farhan (thanks for going through all my PRs!) and Tanishtha (who helped us with her vast knowledge in and around NLP). I express my sincere thanks to all my peers working as part of WoC. + +Woc was a warm and unforgettable experience for me. Everytime I talk of Open Source, TextSentiment Analysis will be remembered fondly.