Skip to content

Latest commit

 

History

History
22 lines (13 loc) · 2.21 KB

README.md

File metadata and controls

22 lines (13 loc) · 2.21 KB

kNN Text Classifier

Introduction

This is a simple text classifier that uses the k-Nearest Neighbors algorithm to classify documents into one of many categories found in the training corpus. The classifier is trained on a set of JSON documents, and then can be used on other, unclassified JSON objects to assign them to a . The classifier is implemented in using only the Python standard libarires and NLTK for tokenization; no ML libraries were used as the algorithms are all hand-written.

The classifier is implemented in the file knn_create_vectors.py. The file knn_prediction.py allows you to input a corpus of JSON objects that will then be classified using the vectors generated by knn_create_vectors.py. The files in the data folder can be used to train and test the classifier, though you can use your own data as well so long as all the JSON objects in your training corpus have a category field.

Usage

  • Before running the program, run the command pip install -r requirements.txt from the main directory to install the required packages.
  • The following command: python3 ./knn/knn_create_vectors.py ./data/train.json ./knn_bbc_vectors.tsv perform the necessary preprocessing on the data from the /data/train.json file and output the generated vectors to the file knn_bbc_vectors.tsv in the main folder.
  • The command: python3 ./knn/knn_prediction.py ./knn_bbc_vectors.tsv ./data/test.json 11 will run the program on the data from the /data/test.json file using the predictions stored in the knn_bbc_vectors.tsv file in the main folder.
    • This last argument can be any integer, and represents the k value to use for the kNN algorithm.
    • This will print statistics about the predictions generated by the model to the console in a formatted table.
  • To see the usage of the program, you can run the program with the -h flag, which will print the usage of the program.

Analysis

This algorithm is not the most efficient implementation of kNN since it uses no Python ML libraries and has almost no feature-selection. However, for higher values of k (>10), the prediction program performs quite well. It is both fast and both micro and macro F1 scores are > 0.9, which is quite good for a simple classifier.