CSCI 59000 BIG DATA ANALYTICS PROJECT
- Programmed a XML parser in python using xml.etree.ElementTree package
- Text data is Pre-processed by removing special characters
- Word embeddings of text are created using Word2Vec tool and tokenized.
- A deep learning model is created using TensorFlow framework by implementing Long-short term memory (LSTM) based Recurrent neural networks.
- Bigrams are created for the text after undergoing pre-processing, which includes removing stop words and stemming.
- Naïve-bayes classification model is built using Bigrams and nltk package.
- Performance analysis of both models is done by drawing ROC curves, by comparing accuracies, and Area Under Curve.
Python packages: numpy, tensorflow, matplotlib, nltk, sklearn, itertools
sample Amazon XML dataset
used tensorflow
used nltk
Predictive Model | Accuracy |
---|---|
Naive Bayes Classification | 68.5 |
RNN using LSTM | 70.83 |
Naive Bayes Classification with Bigrams | 74.4 |
Naive Bayes Classification with Bigrams showed higest acuracy using nltk