Skip to content

Commit

Permalink
Fix links to docs
Browse files Browse the repository at this point in the history
  • Loading branch information
andrewdalpino committed Jan 25, 2021
1 parent b45839d commit 6593327
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@ $ composer create-project rubix/iris
## Tutorial

### Introduction
The Iris dataset consists of 50 samples for each of three species of Iris flower - Iris setosa, Iris virginica, and Iris versicolor (pictured below). Each sample is comprised of 4 measurements or *features* - sepal length, sepal width, petal length, and petal width. Our objective is to train a [K Nearest Neighbors](https://docs.rubixml.com/en/latest/classifiers/k-nearest-neighbors.html) (KNN) classifier to determine the species of Iris flower from a set of unknown test samples using the Iris dataset. Let's get started!
The Iris dataset consists of 50 samples for each of three species of Iris flower - Iris setosa, Iris virginica, and Iris versicolor (pictured below). Each sample is comprised of 4 measurements or *features* - sepal length, sepal width, petal length, and petal width. Our objective is to train a [K Nearest Neighbors](https://docs.rubixml.com/classifiers/k-nearest-neighbors.html) (KNN) classifier to determine the species of Iris flower from a set of unknown test samples using the Iris dataset. Let's get started!

![Iris Flower Species](https://raw.githubusercontent.com/RubixML/Iris/master/docs/images/iris-species.png)

### Extracting the Data
The first step is to extract the Iris dataset from the `dataset.ndjson` file in our project folder into our training script. You'll notice that we've provided the Iris dataset in CSV (Comma-separated Values) format as well. This is strictly for convenience in case you wanted to view the dataset in your favorite spreadsheet software. To instantiate a new [Labeled](https://docs.rubixml.com/en/latest/datasets/labeled.html) dataset object we'll pass an [NDJSON](https://docs.rubixml.com/en/latest/extractors/ndjson.html) extractor pointing to the dataset file in our project folder to the `fromIterator()` factory method. The factory uses the last column of the data table for the labels and the rest of the columns for the values of the sample features. We'll call this our *training* set.
The first step is to extract the Iris dataset from the `dataset.ndjson` file in our project folder into our training script. You'll notice that we've provided the Iris dataset in CSV (Comma-separated Values) format as well. This is strictly for convenience in case you wanted to view the dataset in your favorite spreadsheet software. To instantiate a new [Labeled](https://docs.rubixml.com/datasets/labeled.html) dataset object we'll pass an [NDJSON](https://docs.rubixml.com/extractors/ndjson.html) extractor pointing to the dataset file in our project folder to the `fromIterator()` factory method. The factory uses the last column of the data table for the labels and the rest of the columns for the values of the sample features. We'll call this our *training* set.

> **Note:** The source code for this example can be found in the [train.php](https://github.com/RubixML/Iris/blob/master/train.php) file in project root.
Expand All @@ -39,7 +39,7 @@ $testing = $dataset->randomize()->take(10);
```

### Instantiating the Learner
Next, we'll instantiate the [K Nearest Neighbors](https://docs.rubixml.com/en/latest/classifiers/k-nearest-neighbors.html) classifier and choose the value of the `k` hyper-parameter. Hyper-parameters are constructor parameters that effect the behavior of the learner during training and inference. KNN is a distance-based algorithm that finds the *k* closest samples from the training set and predicts the label that is most common. For example, if we choose `k` equal to 5, then we may get 4 labels that are `Iris setosa` and 1 that is `Iris virginica`. In this case, the estimator would predict Iris-setosa because that is the most common label. To instantiate the learner, pass the value of hyper-parameter `k` to the constructor of the learner. Refer to the docs for more info on KNN's additional hyper-parameters.
Next, we'll instantiate the [K Nearest Neighbors](https://docs.rubixml.com/classifiers/k-nearest-neighbors.html) classifier and choose the value of the `k` hyper-parameter. Hyper-parameters are constructor parameters that effect the behavior of the learner during training and inference. KNN is a distance-based algorithm that finds the *k* closest samples from the training set and predicts the label that is most common. For example, if we choose `k` equal to 5, then we may get 4 labels that are `Iris setosa` and 1 that is `Iris virginica`. In this case, the estimator would predict Iris-setosa because that is the most common label. To instantiate the learner, pass the value of hyper-parameter `k` to the constructor of the learner. Refer to the docs for more info on KNN's additional hyper-parameters.

```php
use Rubix\ML\Classifiers\KNearestNeighbors;
Expand All @@ -66,7 +66,7 @@ During inference, the KNN algorithm interprets the features of the samples as sp
![Iris Dataset 3D Plot](https://raw.githubusercontent.com/RubixML/Iris/master/docs/images/iris-dataset-3d-plot.png)

### Validation Score
We can test the model generated during training by comparing the predictions it makes to the ground-truth labels from the testing set. We'll need to choose a cross validation [Metric](https://docs.rubixml.com/en/latest/cross-validation/metrics/api.html) to output a score that we'll interpret as the generalization ability of our newly trained estimator. The [Accuracy](https://docs.rubixml.com/en/latest/cross-validation/metrics/accuracy.html) is a simple classification metric that ranges from 0 to 1 and is calculated as the number of correct predictions to the total number of predictions. To obtain the accuracy score, pass the predictions we generated from the model earlier along with the labels from the testing set to the `score` method on the metric instance.
We can test the model generated during training by comparing the predictions it makes to the ground-truth labels from the testing set. We'll need to choose a cross validation [Metric](https://docs.rubixml.com/cross-validation/metrics/api.html) to output a score that we'll interpret as the generalization ability of our newly trained estimator. The [Accuracy](https://docs.rubixml.com/cross-validation/metrics/accuracy.html) is a simple classification metric that ranges from 0 to 1 and is calculated as the number of correct predictions to the total number of predictions. To obtain the accuracy score, pass the predictions we generated from the model earlier along with the labels from the testing set to the `score` method on the metric instance.

```php
use Rubix\ML\CrossValidation\Metrics\Accuracy;
Expand All @@ -90,7 +90,7 @@ Accuracy is 90%
```

### Next Steps
Congratulations on completing the introduction to machine learning in PHP with Rubix ML using the Iris dataset. Now you're ready to experiment on your own. For example, you may want to try different values of `k` or swap out the default [Euclidean](https://docs.rubixml.com/en/latest/kernels/distance/euclidean.html) distance kernel for another one such as [Manhattan](https://docs.rubixml.com/en/latest/kernels/distance/manhattan.html) or [Minkowski](https://docs.rubixml.com/en/latest/kernels/distance/minkowski.html).
Congratulations on completing the introduction to machine learning in PHP with Rubix ML using the Iris dataset. Now you're ready to experiment on your own. For example, you may want to try different values of `k` or swap out the default [Euclidean](https://docs.rubixml.com/kernels/distance/euclidean.html) distance kernel for another one such as [Manhattan](https://docs.rubixml.com/kernels/distance/manhattan.html) or [Minkowski](https://docs.rubixml.com/kernels/distance/minkowski.html).

## Original Dataset
Creator: Ronald Fisher
Expand Down

0 comments on commit 6593327

Please sign in to comment.