This exercise is an introduction to principal component analysis in Python. The classical iris data set is used as an example. Check here for an explanation.
Thanks to Jonas Attrup Rasmussen for providing the markdown version of the exercise.
After completing this exercise, the student should be able to do the following:
- Use NumPy to read text files.
- Create a data matrix from a text file
- Organise measured data into a data matrix
- Compute the variance of a set of measurements
- Compute the covariance between two sets of measurements
- Use the function pairplot from the seaborn package to visualise the covariance between multiple sets of measurements
- Compute the covariance matrix from multiple sets of measurements using matrix multiplications
- Compute the covariance matrix from multiple sets of measurements using the NumPy cov function.
- Compute the principal components using Eigenvector analysis (NumPy function eig).
- Visualize how much of the total of variation each principal component explain.
- Project original measurements into principal component space
- Use the function pairplot to visualise the covariance structure after projecting to principal component space.
- Compute the PCA using the PCA function from the sci-kit learn decomposition package.
Start by creating an exercise folder where you keep your Python data. Download the data from the data from this repository and place them in this folder.
We start by reading the data and in this example we only use the first 50 measurements from the same type of flower:
Start by reading the data and create a data matrix x
:
import numpy as np
iris_data = np.loadtxt(in_dir + txt_name, comments="%")
# x is a matrix with 50 rows and 4 columns
x = iris_data[0:50, 0:4]
Then check the data dimensions by writing:
n_feat = x.shape[1]
n_obs = x.shape[0]
print(f"Number of features: {n_feat} and number of observations: {n_obs}")
We have 50 flowers and for each flower there are 4 measurements (features): sepal length, sepal width, petal length and petal width. Are your matrix dimensions correct?
To explore the data, we can create vectors of the individual feature:
sep_l = x[:, 0]
sep_w = x[:, 1]
pet_l = x[:, 2]
pet_w = x[:, 3]
Compute the variance of each feature like:
# Use ddof = 1 to make an unbiased estimate
var_sep_l = sep_l.var(ddof=1)
do that for all the features.
Now compute the covariance:
between the sepal length and the sepal width. Please note that we use cov
function. It can, though, still tell us something about the data.
Compute the covariance between the sepal length and the petal length and compare it to the covariance between the sepal length and width. What do you observe?
As with image analysis, it is very useful to get a graphical understanding of the data and what structures are hidden in the them. For this, we will use the seaborn Python package that also used the pandas package. Take a look at Appendix A, at the end of the document, for installation instructions.
Import seaborn and pandas:
import seaborn as sns
import pandas as pd
Now let us get a closer look at the data structure using a pairplot:
plt.figure() # Added this to make sure that the figure appear
# Transform the data into a Pandas dataframe
d = pd.DataFrame(x, columns=[’Sepal length’, ’Sepal width’,
’Petal length’, ’Petal width’])
sns.pairplot(d)
plt.show()
What measurements are related and which ones are not-related? Can you recognise the results you found, when you computed the variance and covariance?
We will do the principal component analysis (PCA) in two ways. First, a method combining several steps and finally using a dedicated pca library that does all at once.
In the first approach, we do the analysis step-wise. Start by subtracting the mean from the data:
mn = np.mean(x, axis=0)
data = x - mn
Now compute the covariance matrix using: matmul
to multiply two matrices together. Also remember to transpose data in one of the arguments to the function.
You can also use the NumPy function cov to compute the covariance matrix. Verify if the two approaches give the same result?
Now you can compute the principal components using eigenvector analysis:
values, vectors = np.linalg.eig(c_x) # Here c_x is your covariance matrix.
The values are the eigenvalues and the vectors are the eigenvectors (the principal components).
Lets examine some properties of the principal components. First try to find out how much of the total variation the first component explains?
You can also plot the amount of explained variation for each component:
v_norm = values / values.sum() * 100
plt.plot(v_norm)
plt.xlabel(’Principal component’)
plt.ylabel(’Percent explained variance’)
plt.ylim([0, 100])
plt.show()
The data can be projected onto the PCA space by using the dot-product:
pc_proj = vectors.T.dot(data.T)
Try to use seaborns pairplot with the projected data? How does the covariance structure look?
The Python machine learning package sci-kit learn (sklearn) have several functions to do data decompositions, where PCA is one of them.
Let us explore if we get the same results using this function.
Start by installing sklearn (see Appendix A) and importing the package:
from sklearn import decomposition
Read the data matrix as before, but do not subtract the mean. The procedure subtracts the mean for you.
The PCA can be computed using:
pca = decomposition.PCA()
pca.fit(x)
values_pca = pca.explained_variance_
exp_var_ratio = pca.explained_variance_ratio_
vectors_pca = pca.components_
data_transform = pca.transform(x)
Compare the results from the results you found using the step-by-step procedure. Some of the results are transposed - which ones?
sklearn (sci-kit learn) (Further info here and here)
activate course02502
conda install -c anaconda scikit-learn
seaborn (Further info here and here)
activate course02502
conda install -c anaconda seaborn