PCA example (iris dataset)¶

In this example, we perform the PCA dimensionality reduction of the classical iris dataset (Ronald A. Fisher. “The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7, pp.179-188, 1936).

First we laod the spectrochempy API package

import spectrochempy as scp

Upload a dataset form a distant server

try:
    dataset = scp.download_iris()
except (IOError, OSError):
    print("Could not load The `IRIS` dataset. Finishing here.")
    import sys

    sys.exit(0)

Create a PCA object Here, the number of components wich is used by the model is automatically determined using n_components="mle". Warning: mle cannot be used when n_observations < n_features.

pca = scp.PCA(n_components="mle")

Fit dataset with the PCA model

pca.fit(dataset)

<spectrochempy.analysis.pca.PCA object at 0x7f15afba0d00>

The number of components found is 3:

pca.n_components

It explain 99.5 % of the variance

pca.cumulative_explained_variance[-1].value

99.48169145498102 %

We can also specify the amount of explained variance to compute how much components are needed (a number between 0 and 1 for n_components is required to do this). we found 4 components in this case

pca = scp.PCA(n_components=0.999)
pca.fit(dataset)
pca.n_components

the 4 components found are in the components attribute of pca. These components are often called loadings in PCA analysis.

loadings = pca.components
loadings

name	`IRIS` Dataset_PCA.components
author	runner@fv-az626-878
created	2023-06-06 01:27:21+00:00
history	2023-06-06 01:27:21+00:00> Created using method PCA.components
DATA
title	size
values	[[ 0.3616 -0.08227 0.8566 0.3588] [ 0.6565 0.7297 -0.1758 -0.07471] [ -0.581 0.5964 0.07252 0.5491] [ 0.3173 -0.3241 -0.4797 0.7511]]
shape	(y:4, x:4)
DIMENSION `x`
size	4
title	features
labels	[ sepal_length sepal width petal_length petal_width]
DIMENSION `y`
size	4
title	components
labels	[ #0 #1 #2 #3]

Note: it is equivalently possible to use the loadings attribute of pca, which produce the same results.

pca.loadings

name	`IRIS` Dataset_PCA.get_components
author	runner@fv-az626-878
created	2023-06-06 01:27:21+00:00
history	2023-06-06 01:27:21+00:00> Created using method PCA.get_components
DATA
title
values	[[ 0.3616 -0.08227 0.8566 0.3588] [ 0.6565 0.7297 -0.1758 -0.07471] [ -0.581 0.5964 0.07252 0.5491] [ 0.3173 -0.3241 -0.4797 0.7511]]
shape	(y:4, x:4)
DIMENSION `x`
size	4
title	features
labels	[ sepal_length sepal width petal_length petal_width]
DIMENSION `y`
size	4
title	components
labels	[ #0 #1 #2 #3]

To Reduce the data to a lower dimensionality using these three components, we use the transform methods. The results is often called scores for PCA analysis.

scores = pca.transform()
scores

name	`IRIS` Dataset_PCA.transform
author	runner@fv-az626-878
created	2023-06-06 01:27:21+00:00
history	2023-06-06 01:27:21+00:00> Created using method PCA.transform
DATA
title
values	[[ -2.684 0.3266 -0.02151 0.001006] [ -2.715 -0.1696 -0.2035 0.0996] ... [ 1.902 0.1159 0.7229 0.04087] [ 1.39 -0.2829 0.3623 -0.1563]]
shape	(y:150, x:4)
DIMENSION `x`
size	4
title	components
labels	[ #0 #1 #2 #3]
DIMENSION `y`
size	150
title	samples
labels	[ Iris-setosa Iris-setosa ... Iris-virginica Iris-virginica]

Again, we can also use the scores attribute to get this results

scores = pca.scores
scores

name	`IRIS` Dataset_PCA.transform
author	runner@fv-az626-878
created	2023-06-06 01:27:21+00:00
history	2023-06-06 01:27:21+00:00> Created using method PCA.transform
DATA
title
values	[[ -2.684 0.3266 -0.02151 0.001006] [ -2.715 -0.1696 -0.2035 0.0996] ... [ 1.902 0.1159 0.7229 0.04087] [ 1.39 -0.2829 0.3623 -0.1563]]
shape	(y:150, x:4)
DIMENSION `x`
size	4
title	components
labels	[ #0 #1 #2 #3]
DIMENSION `y`
size	150
title	samples
labels	[ Iris-setosa Iris-setosa ... Iris-virginica Iris-virginica]

The figures of merit (explained and cumulative variance) confirm that these 4 PC’s explain 100% of the variance:

pca.printev()

PC      Eigenvalue              %variance               %cumulative
         of cov(X)                 per PC                  variance
#1       2.055e+00                 92.462                    92.462
#2       4.922e-01                  5.302                    97.763
#3       2.802e-01                  1.719                    99.482
#4       1.539e-01                  0.518                   100.000

These figures of merit can also be displayed graphically

The ScreePlot

_ = pca.screeplot()

The score plots can be used for classification purposes. The first one - in 2D for the 2 first PC’s - shows that the first PC allows distinguishing Iris-setosa (score of PC#1 < -1) from other species (score of PC#1 > -1), while more PC’s are required to distinguish versicolor from viginica.

_ = pca.scoreplot(scores, 1, 2, color_mapping="labels")

The second one - in 3D for the 3 first PC’s - indicates that a thid PC won’t allow better distinguishing versicolor from viginica.

ax = pca.scoreplot(scores, 1, 2, 3, color_mapping="labels")
ax.view_init(10, 75)

This ends the example ! The following line can be uncommented if no plot shows when running the .py script

# scp.show()

Total running time of the script: ( 0 minutes 1.451 seconds)

Gallery generated by Sphinx-Gallery