This is a simple text classifier that uses the k-Nearest Neighbors algorithm to classify documents into one of many categories found in the training corpus. The classifier is trained on a set of JSON documents, and then can be used on other, unclassified JSON objects to assign them to a . The classifier is implemented in using only the Python standard libarires and NLTK for tokenization; no ML libraries were used as the algorithms are all hand-written.
The classifier is implemented in the file knn_create_vectors.py
. The file knn_prediction.py
allows you to input a corpus of JSON objects that will then be classified using the vectors generated by knn_create_vectors.py
. The files in the data
folder can be used to train and test the classifier, though you can use your own data as well so long as all the JSON objects in your training corpus have a category
field.
- Before running the program, run the command
pip install -r requirements.txt
from the main directory to install the required packages. - The following command:
python3 ./knn/knn_create_vectors.py ./data/train.json ./knn_bbc_vectors.tsv
perform the necessary preprocessing on the data from the/data/train.json
file and output the generated vectors to the fileknn_bbc_vectors.tsv
in the main folder. - The command:
python3 ./knn/knn_prediction.py ./knn_bbc_vectors.tsv ./data/test.json 11
will run the program on the data from the/data/test.json
file using the predictions stored in theknn_bbc_vectors.tsv
file in the main folder.- This last argument can be any integer, and represents the
k
value to use for the kNN algorithm. - This will print statistics about the predictions generated by the model to the console in a formatted table.
- This last argument can be any integer, and represents the
- To see the usage of the program, you can run the program with the
-h
flag, which will print the usage of the program.
This algorithm is not the most efficient implementation of kNN since it uses no Python ML libraries and has almost no feature-selection. However, for higher values of k (>10), the prediction program performs quite well. It is both fast and both micro and macro F1 scores are > 0.9, which is quite good for a simple classifier.