This application classifies comments by subbredit using a dataset belonging to a Kaggle competition. There are 20 subreddits and running the classifier will produce a file containing the class predictions.
Libra
Before running, please create a folder called 'resources' at the root of this project and
then add the data_test.pkl
and data_train.pkl
files to that folder.
The code for the models in the report is all included. To run the best classifier against the data_test.pkl
file from Kaggle that produces the best accuracy,
please run the python file called mnb_cnb_sgd.py
located in the root directory of this project:
python mnb_cnb_sgd.py
After completion, the generated file will be called output.csv
and will be located in the resources folder.
Note: Python 3 is required to run this project
- Scikit-learn
- Numpy
- NLTK
Jessica Gauvin
Maziar Mohammad-Shahi