In this notebook we will explore the impact of implementing Principal Component Anlysis to an image dataset. For this, we will use the benchmark Fashion MNIST dataset, the link to this dataset can be found here. Indeed, the images from the dataset are 784-dimensional images. In this small tutorial we seek to explore if we can further compress the dimension of these images without loosing valuable information.
We will first load all necessary libraries as well as some helper functions that will allow us to prepare the data.
import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
from helpers import *
import pandas as pd
import sklearn
from sklearn.decomposition import PCA
/home/federicoarenasl/miniconda3/envs/py3iaml/lib/python3.7/site-packages/sklearn/utils/__init__.py:4: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
from collections import Sequence
As described here, the dataset contains 60k training examples, and 10k testing examples. Each training example is accomanied with a respective label, which can either be:
- 0 T-shirt/top
- 1 Trouser
- 2 Pullover
- 3 Dress
- 4 Coat
- 5 Sandal
- 6 Shirt
- 7 Sneaker
- 8 Bag
- 9 Ankle boot
We will visualize these labels in due time but first, let's load the data using the load_FashionMNIST method provided by the data providors.
# Load Data
DataPath = 'data'
Xtrn, Ytrn, Xtst, Ytst = load_FashionMNIST(DataPath)
Let's make a copy of the original data, since we'll make further transformations to Xtn, Ytrn, Xtst,
and Ytst
.
# Make backup
Xtrn_orig, Xtst_orig = Xtrn, Xtst
Let's prepare our data for the PCA by dividing all elements in the dataset by 255, so we have pixel intensities between 0 and 1 and let's center the data by substracting their means. This is called normalizing the data, and it is a key pre-processing step before any PCA can be applied on a dataset.
# Divide elements
Xtrn = np.divide(Xtrn, 255)
Xtst = np.divide(Xtst, 255)
# Calculate means
Xmean = Xtrn.mean(axis=0)
# Substract means
Xtrn_nm = Xtrn - Xmean
Xtst_nm = Xtst - Xmean
Let's now focus our attention in visualizing the mean representation of each class. We can do this by getting the mean vector from each class and plotting it with the help of matplotlib's ax.imshow
method. We can further our anlysis by looking at the 2 closest and 2 furthest samples from the mean. This can be done by comparing the Euclidean Distance of all samples to the mean, sorting them, and grabbing the two that have the smallest distance, and the two that have the furthest.
sns.set_style("whitegrid", {'axes.grid' : False})
# Create dataframe that joins the classes and features and sorts them by class
data_ds = pd.concat([pd.DataFrame(data=Xtrn), pd.DataFrame(data=Ytrn, columns=['class'])]
, axis=1).sort_values(by=['class'])
# Get dfs of each class with instances sorted by Euclidean Distance to the mean of the class
class_dfs_ED, class_mean_dictionnaries = EuclideanCalculator(data_ds)
# Plot the grid of images
fig, axes = plt.subplots(10,5, figsize = (10,18), sharex=True, sharey = True, constrained_layout = True )
for i in range(len(axes)):
axes[i,0].set_ylabel("Class "+ str(i), fontsize=14)
axes[i,0].imshow(class_mean_dictionnaries[i].reshape(28,28), aspect='auto', cmap = "gray_r")
axes[i,1].set_title("Sample "+str(class_dfs_ED[i].index[0]))
axes[i,1].imshow(class_dfs_ED[i].iloc[0, 1:].values.reshape(28,28), aspect='auto', cmap = "gray_r")
axes[i,2].set_title("Sample "+str(class_dfs_ED[i].index[1]))
axes[i,2].imshow(class_dfs_ED[i].iloc[1, 1:].values.reshape(28,28), aspect='auto', cmap = "gray_r")
axes[i,3].set_title("Sample "+str(class_dfs_ED[i].index[-2]))
axes[i,3].imshow(class_dfs_ED[i].iloc[-2, 1:].values.reshape(28,28), aspect='auto', cmap = "gray_r")
axes[i,4].set_title("Sample "+str(class_dfs_ED[i].index[-1]))
axes[i,4].imshow(class_dfs_ED[i].iloc[-1, 1:].values.reshape(28,28), aspect='auto', cmap = "gray_r")
axes[i,0].tick_params(axis=u'both', which=u'both',length=0)
axes[9,0].set_xlabel("Mean", fontsize=14)
axes[9,1].set_xlabel("1st Closest", fontsize=14)
axes[9,2].set_xlabel("2nd Closest", fontsize=14)
axes[9,3].set_xlabel("2nd Furthest", fontsize=14)
axes[9,4].set_xlabel("1st Furthest", fontsize=14)
fig.suptitle("2 closest and 2 furthest samples from the mean", y=1.005, fontweight = 'bold', fontsize=15)
plt.tight_layout()
plt.show()
From the plot we can see that the Mean represents the protypical representation of the class. The samples that have the smallest Euclidean distance are the ones whose shape varies the least from the shape of the Mean. The samples that have the largest Euclidean distance are the ones whose shape differs the most from the Mean. With our Principal Components (PCs) analysis we hope to find PCs that allow us to represent the most samples faithfully with much less dimensions.
The PCA will try to extract vector representations of the data that represent our data the best, this should include the samples that are the furthest from the Mean. Let's start this analyisis by performing some initial PCA on our data.
Let's perform PCA on our data by using the PCA
class from sklearn
, and let's look at the explained variances from the first 5 components, we should find some interesting insights, since variance is the key decision factor for our PCA. More on this here.
# Initialize PCA object
pca = PCA()
# Fit PCA to normalized data with no components
fitted = pca.fit_transform(Xtrn_nm)
# Cumulative explained variance
cum_exp_variance = np.cumsum(pca.explained_variance_)
# Print first 5 PCs' cumulative explained variance
# Should we print the the explained variances or the cumulative explained variances?
# Return the pca object for further use
print(f"First 5 PCs: {pca.explained_variance_[:5]}")
First 5 PCs: [19.80980567 12.11221047 4.10615661 3.38182839 2.62477022]
From the results we can see that the first Principal Component accounts for a large part of the variance of the data. The variance for which a PC accounts decreases as we move further away from the 1st PC.
Let's look at this at a much larger scale, let's see what happens with this variance when we extend it to 784 PCs.
# Set plot specifications
sns.set_style('darkgrid')
fig = plt.figure(figsize=(12,8))
lw = 2.5
# Get pca object and explained variance ratio
pca = PCA()
# Fit PCA to normalized data with no components
fitted = pca.fit_transform(Xtrn_nm)
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)
first_evec_90 = np.where(cumulative_variance_ratio >= 0.9)[0][0]
# Plot cumulative explained variance ratio
PCs = list(range(1, len(explained_variance_ratio)+1))
plt.plot(PCs, np.cumsum(explained_variance_ratio), color= 'teal', linewidth = lw)
plt.annotate("PC = "+str(first_evec_90+1)+", CVR = 90%", (90, 0.9)).set_fontsize(15)
plt.title('Cumulative Variance Ratio per Principal Component',fontsize=15, fontweight="bold")
plt.xticks(np.arange(0,len(PCs), 50))
plt.xlabel('Kth Principal Component', fontsize = 15)
plt.ylabel('Cumulative Variance Ratio', fontsize = 15)
plt.show()
From the figure we can see that the first 84 Principal Components are able to explain 90% of the variance in the Fashion-MNIST data. We could then take the decision of reducing the dimensionality of the data to 84 dimensions, and most of the variance would be kept. Indeed, we can't perform PCA on a dataset and arbitrarily choose which dimensionality to compress it to. We have to performa thoughtful analysis so we keep the most information we can, while using the least dimensions.
We can perform a little excercise by reducing our data to two dimensions, and see if there is any dicernible information to extract from this reduction. Let's look at this with a simple scatterplot.
sns.set_style("darkgrid")
# Initialize PCA object
pca = PCA(n_components=2)
# Fit PCA to normalized data with no components
fitted_2D = pca.fit_transform(Xtrn_nm)
plt.figure(figsize=(15,10))
lw = 2
sc = plt.scatter(fitted_2D.T[0], fitted_2D.T[1], c=Ytrn, cmap=plt.cm.coolwarm, alpha=1, lw=lw)
clb = plt.colorbar(sc)
clb.ax.set_title('Class', fontsize=15)
plt.xlabel("1st PC", fontsize=12)
plt.ylabel("2nd PC", fontsize=12)
plt.title('2D PCA of Fashion-MNIST dataset', fontweight = 'bold', fontsize=15)
plt.show()
From the figure we can see that a compression from 784D to 2D compresses the data too much for there to be a clear separation between the classes. It is perhaps arguable that there are 3 high-density areas in the data, and that classes 7, 8 and 9 are clustered together at the top of the plot. However, the 2D representation of the data is too poor to actually tell if there are any clear clusters in the data. To better understand these high density areas, we could plot a 3-dimensional plot to determine if there are clusters that purely include only one class, but this wouldn't ensure any improvement from the 2D visualization.
PCA is at times difficult to understand when we think of it in vector space, and all the big linear algebra terms can be daunting. Let's look at what these PCs mean for our image data, looking at them as images.
sns.set_style("whitegrid", {'axes.grid' : False})
# Get pca object and explained variance ratio
pca = PCA()
# Fit PCA to normalized data with no components
fitted = pca.fit_transform(Xtrn_nm)
# Plot the grid of images
fig, axes = plt.subplots(2,5, figsize = (12,5), sharex=True, sharey = True, constrained_layout = True )
# First 10 principal components
f_10_PCS = [pca.components_[i].reshape(28,28) for i in range(0,10)]
for pc in range(0,5):
axes[0,pc].imshow(f_10_PCS[0], aspect='auto', cmap = "gray_r")
axes[1,pc].imshow(f_10_PCS[pc+5], aspect='auto', cmap = "gray_r")
fig.suptitle("First 10 Principal Components", y=1.01, fontweight = 'bold', fontsize='15')
plt.tight_layout()
plt.show()
From the figure we can see that PCA is capable of creating adaptive filters that fit the trends in the data. In this case, the first 10 Principal Components show most of the shapes that we will see in the datasets, from long shirts to shoes. Furthermore, since the first PC is the one that explains the most variance in the data, it shows that the most variance in the data can be found between the pixel values of a dark shirt and the pixel values of a white shoe.
Let's now look at the impact of using different sklearn.metrics.mean_squared_error
method.
# Classes and K components to loop over
classes = list(range(0,10))
k_components = [5, 20, 50, 200]
k_general_dict = {}
for k_component in k_components:
# PCA instanciation
pca_k = PCA(n_components = k_component, random_state=1000)
# Perform PCA in all samples from single class
reduced = pca_k.fit_transform(Xtrn)
# Recover reduced samples
recovered = pca_k.inverse_transform(reduced)
# Get provisional components
k_general_dict[k_component] = recovered
# Initialize dictionnary to store recovered first samples from each class
k_pca_dict = {}
# Initialize dictionnary to store RMSE for each first original/recovered sample from each class
k_RMSE_dict = {}
for class_ in classes:
# Initialize dictionnaries to store Recovered Sample and RMSE
class_components = {}
rmse_components = {}
# Get original class
original_class = Xtrn[np.where(Ytrn==class_)]
for k_component in k_components:
# Get recovered from K components
k_recovered = k_general_dict[k_component]
# Calculate RMSE
recovered_class = k_recovered[np.where(Ytrn==class_)]
# Store first recovered sample
class_components[str(k_component)] = recovered_class[0]
# Root Mean Square Error Calculation of first sample
rmse = np.sqrt(sklearn.metrics.mean_squared_error(original_class[0], recovered_class[0]))
key = "K="+str(k_component)
rmse_components[key] = rmse
# Store PCA and RMSE values for all K components for a single class
k_RMSE_dict[class_] = rmse_components
k_pca_dict[class_] = class_components
pd.DataFrame.from_dict(k_RMSE_dict, orient='index')
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
K=5 | K=20 | K=50 | K=200 | |
---|---|---|---|---|
0 | 0.001005 | 0.000589 | 0.000500 | 0.000241 |
1 | 0.000777 | 0.000551 | 0.000375 | 0.000136 |
2 | 0.000779 | 0.000570 | 0.000485 | 0.000322 |
3 | 0.000571 | 0.000421 | 0.000326 | 0.000223 |
4 | 0.000464 | 0.000403 | 0.000344 | 0.000184 |
5 | 0.000710 | 0.000624 | 0.000558 | 0.000353 |
6 | 0.000508 | 0.000375 | 0.000283 | 0.000180 |
7 | 0.000650 | 0.000502 | 0.000419 | 0.000250 |
8 | 0.000876 | 0.000569 | 0.000484 | 0.000362 |
9 | 0.000720 | 0.000595 | 0.000479 | 0.000279 |
We can see that, as expected, the more components we use to represent our data, the more the RMSE approaches 0. However, classes such as
Finally, let's compress our own data and visualize some of these samples to see what the PCA does on our data.
# Q1.7
def iaml01cw2_q1_7():
sns.set_style("whitegrid", {'axes.grid' : False})
# Plot the grid of images
fig, axes = plt.subplots(10,4, figsize = (10,18), sharex=True, sharey = True, constrained_layout = True )
classes = list(range(0,10))
k_components = ['5', '20', '50', '200']
# Class means
data_ds = pd.concat([pd.DataFrame(data=Xtrn_nm), pd.DataFrame(data=Ytrn, columns=['class'])]
, axis=1).sort_values(by=['class'])
class_mean_dict = NormMeanCalculator(data_ds)
# Dictionnary of reconstructed samples per class per K components
k_pca_dict = iaml01cw2_q1_6()[0]
for i in range(len(classes)):
# Add mean to reconstructed samples
k5_rec_s = k_pca_dict[i]['5'] + class_mean_dict[i]
k20_rec_s = k_pca_dict[i]['20'] + class_mean_dict[i]
k50_rec_s = k_pca_dict[i]['50'] + class_mean_dict[i]
k200_rec_s = k_pca_dict[i]['200'] + class_mean_dict[i]
axes[i,0].set_ylabel("Class "+ str(i), fontsize=16)
axes[i,0].imshow(k5_rec_s.reshape(28,28), aspect='auto', cmap = "gray_r")
axes[i,1].imshow(k20_rec_s.reshape(28,28), aspect='auto', cmap = "gray_r")
axes[i,2].imshow(k50_rec_s.reshape(28,28), aspect='auto', cmap = "gray_r")
axes[i,3].imshow(k200_rec_s.reshape(28,28), aspect='auto', cmap = "gray_r")
axes[9,0].set_xlabel("K=5", fontsize=16)
axes[9,1].set_xlabel("K=20", fontsize=16)
axes[9,2].set_xlabel("K=50", fontsize=16)
axes[9,3].set_xlabel("K=200", fontsize=16)
#fig.suptitle("First Sample reconstruction with K components", y=1.01, fontweight = 'bold', fontsize='15')
plt.tight_layout()
plt.show()
iaml01cw2_q1_7() # comment this out when you run the function
From the figure we can see that as the number of K components are increased, the level of detail of the representation increases. Also, we can see that for each column of images, the ones that appear the noisiest are the ones that have the highest RMSEs. This is because the PCA algorithm has more difficulty to represent the classes’ variance in less dimensions.
From this small demostration we can see that PCA is a powerful tool to compress our data. Indeed, we could use this PCA on the Fashion MNIST data to speed up the training of a clothing recognition system. However, a key takeaway is that we must be sensitive about how much we compress our data. We have to make a thorough analysis before choosing the dimension we will reduce our images to.