The first week's programming assignment for the UCSanDiego online course is centred around the Nearest Neighbour classifiers.
We provide three types of classifiers:
- A simple 1-NN classifier without any preprocessing methods based on L1 and L2 distance functions
- A 1-NN BallTree classifier
- A 1-NN KDTree classifier
So far, we have focused our tests on two datasets:
- MNIST
- Spine
The MNIST dataset we use is a part of the well-known MNIST dataset. This dataset consists of 7500 train cases and 1000 test ones.
The Spine data set contains information from 310 patients. For each patient, there are: six measurements (the x) and a label (the y). The label has 3 possible values, ’NO’ (normal), ’DH’ (herniated disk), or ’SL’ (spondilolysthesis).
For the MNIST dataset, the train and test data are already separated. But, for the Spine dataset, we use the method of cross-validation with a factor of 5 to test our models on.
This project uses Python 3.10.12 to run and for a list of requirements consult the requirements.txt list.
To run the project, configure the conf.yaml
with data about the preprocessing method and dataset features. Then run the entry point classification.py
.
The following table shows the average error and accumulative time of running on the two datasets.
Dataset | MNIST | Spine |
---|---|---|
Naive | Time: 33.88242554664612(s), Avg. Error: 0.046 | Time: 0.2945997714996338(s), Avg. Error: 0.36129032258064514 |
BallTree | Time: 6.562798976898193(s), Avg. Error: 0.046 | Time: 0.0025920867919921875(s), Avg. Error: 0.36129032258064514 |
KDTree | Time: 8.31318998336792(s), Avg. Error: 0.046 | Time: 0.002149820327758789(s), Avg. Error: 0.36129032258064514 |