Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CON-2310 add missing links to datasets in pipeline subejct #2320

Merged
merged 3 commits into from
Nov 28, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 16 additions & 16 deletions subjects/ai/pipeline/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,12 @@ Today we will focus on the data preprocessing and discover the Pipeline object f
- The **step 1** is always necessary. Models use numbers, for instance string data can't be processed raw.
- The **steps 2** is always necessary. Machine learning models use numbers, missing values do not have mathematical representations, that is why the missing values have to be imputed.
- The **step 3** is required when the dimension of the data set is high. The dimension reduction algorithms reduce the dimensionality of the data either by selecting the variables that contain most of the information (SelectKBest) or by transforming the data. Depending on the signal in the data and the data set size the dimension reduction is not always required. This step is not covered because of its complexity. The understanding of the theory behind is important. However, I suggest to give it a try during the projects.

- The **step 4** is required when using some type of Machine Learning algorithms. The Machine Learning algorithms that require the feature scaling are mostly KNN (K-Nearest Neighbors), Neural Networks, Linear Regression, and Logistic Regression. The reason why some algorithms work better with feature scaling is that the minimization of the loss function may be more difficult if each feature's range is completely different.

These steps are sequential. The output of step 1 is used as input for step 2 and so on; and, the output of step 4 is used as input for the Machine Learning model.
Scikitlearn proposes an object: Pipeline.

As we know, the model evaluation methodology requires to split the data set in a train set and test set. **The preprocessing is learned/fitted on the training set and applied on the test set**.
As we know, the model evaluation methodology requires splitting the data set in a train set and test set. **The preprocessing is learned/fitted on the training set and applied on the test set**.

This object takes as input the preprocessing transforms and a Machine Learning model. Then this object can be called the same way a Machine Learning model is called. This is pretty practical because we do not need anymore to carry many objects.

Expand All @@ -39,7 +38,7 @@ This object takes as input the preprocessing transforms and a Machine Learning m
- Scikit Learn
- Jupyter or JupyterLab

_Version of Scikit Learn I used to do the exercises: 0.22_. I suggest to use the most recent one. Scikit Learn 1.0 is finally available after ... 14 years.
_Version of Scikit Learn I used to do the exercises: 0.22_. I suggest using the most recent one. Scikit Learn 1.0 is finally available after ... 14 years.

### **Resources**

Expand All @@ -63,13 +62,13 @@ _Version of Scikit Learn I used to do the exercises: 0.22_. I suggest to use the

The goal of this exercise is to set up the Python work environment with the required libraries.

**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
**Note:** For each quest, your first exercise will be to set up the virtual environment with the required libraries.

I recommend to use:

- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
- the virtual environment you're the most comfortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recent versions of the libraries required

1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`.

Expand All @@ -79,7 +78,7 @@ I recommend to use:

# Exercise 1: Imputer 1

The goal of this exercise is to learn how to use an Imputer to fill missing values on basic example.
The goal of this exercise is to learn how to use an `Imputer` to fill missing values on basic example.

```python
train_data = [[7, 6, 5],
Expand All @@ -89,9 +88,9 @@ train_data = [[7, 6, 5],

1. Fit the `SimpleImputer` on the data. Print the `statistics_`. Check that the statistics match `np.nanmean(train_data, axis=0)`.

2. Fill the missing values in `train_data` using the fitted imputer and `transform`.
2. Fill the missing values in `train_data` using the fitted `imputer` and `transform`.

3. Fill the missing values in `test_data` using the fitted imputer and `transform`.
3. Fill the missing values in `test_data` using the fitted `imputer` and `transform`.

```python
test_data = [[np.nan, 1, 2],
Expand Down Expand Up @@ -140,7 +139,7 @@ Resources:

# Exercise 3: One hot Encoder

The goal of this exercise is to learn how to deal with Categorical variables using the OneHot Encoder.
The goal of this exercise is to learn how to deal with Categorical variables using the `OneHot` Encoder.

```python
X_train = [['Python'], ['Java'], ['Java'], ['C++']]
Expand Down Expand Up @@ -204,11 +203,11 @@ _Note: In the version 0.22 of Scikit-learn, the Ordinal Encoder doesn't handle n

# Exercise 5: Categorical variables

The goal of this exercise is to learn how to deal with Categorical variables with Ordinal Encoder, Label Encoder and One Hot Encoder. For this exercice I strongly suggest to use a recent version of `sklearn >= 0.24.1` to avoid issues with the Ordinal Encoder.
The goal of this exercise is to learn how to deal with Categorical variables with Ordinal Encoder, Label Encoder and One Hot Encoder. For this exercise I strongly suggest using a recent version of `sklearn >= 0.24.1` to avoid issues with the Ordinal Encoder.

Preliminary:

- Load the breast-cancer.csv file
- Load the [breast-cancer.csv](./data/breast-cancer.csv) file
- Drop `Class` column
- Drop NaN values
- Split the data in a train set and test set (test set size = 20% of the total size) with `random_state=43`.
Expand Down Expand Up @@ -281,7 +280,7 @@ array(['node-caps_no', 'node-caps_yes', 'breast_left', 'breast_right',

3. Create one Ordinal encoder for all Ordinal features in the following order `["menopause", "age", "tumor-size","inv-nodes", "deg-malig"]` on the test set. The documentation of Scikit-learn is not clear on how to perform this on many columns at the same time. Here's a **hint**:

If the ordinal data set is (subset of two columns but I keep all rows for this example):
If the ordinal data set is (subset of two columns, but I keep all rows for this example):

| | menopause | deg-malig |
|---:|:--------------|------------:|
Expand All @@ -291,7 +290,7 @@ If the ordinal data set is (subset of two columns but I keep all rows for this e
| 3 | premeno | 3 |
| 4 | premeno | 2 |

The first step is to create a dictionnary or a list - the most recent version of sklearn take as input lists:
The first step is to create a dictionary or a list - the most recent version of sklearn take as input lists:

```console
dict_ = {0: ['lt40', 'premeno' , 'ge40'], 1:[1,2,3]}
Expand All @@ -313,7 +312,7 @@ Now that you have enough information:
- Fit on the train set
- Transform the test set

_Hint: Check the first ressource_
_Hint: Check the first resource_

**Note: The version 0.22 of Scikit-learn can't handle `get_feature_names` on `OrdinalEncoder`. If the column transformer contains an `OrdinalEncoder`, the method returns this error**:

Expand All @@ -323,7 +322,7 @@ AttributeError: Transformer ordinalencoder (type OrdinalEncoder) does not provid

**It means that if you want to use the Ordinal Encoder, you will have to create a variable that contains the columns name in the right order. This step is not required in that exercise**

Ressources:
Resources:

- https://towardsdatascience.com/guide-to-encoding-categorical-features-using-scikit-learn-for-machine-learning-5048997a5c79

Expand Down Expand Up @@ -363,4 +362,5 @@ The pipeline you will implement has to contain 3 steps:
1. Train the pipeline on the train set and predict on the test set. Give the score of the model on the test set.

---

---
Loading