Note

Go to the end to download the full example code.

PCA example (iris dataset)

In this example, we perform the PCA dimensionality reduction of the classical iris dataset (Ronald A. Fisher. “The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7, pp.179-188, 1936).

First we laod the spectrochempy API package

import spectrochempy as scp

load a dataset from scikit-learn

dataset = scp.load_iris()

Create a PCA object Here, the number of components wich is used by the model is automatically determined using n_components="mle". Warning: mle cannot be used when n_observations < n_features.

pca = scp.PCA(n_components="mle")

Fit dataset with the PCA model

pca.fit(dataset)

<spectrochempy.analysis.decomposition.pca.PCA object at 0x7f38f9dcf650>

The number of components found is 3:

pca.n_components

It explain 99.5 % of the variance

pca.cumulative_explained_variance[-1].value

99.47878161267252 %

We can also specify the amount of explained variance to compute how much components are needed (a number between 0 and 1 for n_components is required to do this). we found 4 components in this case

pca = scp.PCA(n_components=0.999)
pca.fit(dataset)
pca.n_components

the 4 components found are in the components attribute of pca. These components are often called loadings in PCA analysis.

loadings = pca.components
loadings

NDDataset: [float64] unitless (shape: (k:4, x:4))[`IRIS` Dataset_PCA.components]

Summary

name

:

`IRIS` Dataset_PCA.components

author

:

runner@pkrvmubgrv54qmi

created

:

2025-08-17 01:52:57+00:00

history

:

2025-08-17 01:52:57+00:00> Created using method PCA.components

Data

title

:

size

values

:

...

[[ 0.3614 -0.08452 0.8567 0.3583]
[ 0.6566 0.7302 -0.1734 -0.07548]
[ -0.582 0.5979 0.07624 0.5458]
[ 0.3155 -0.3197 -0.4798 0.7537]]

shape

:

(k:4, x:4)

Dimension `k`

size

:

4

title

:

components

labels

:

[ #0 #1 #2 #3]

Dimension `x`

size

:

4

title

:

features

labels

:

[ sepal_length sepal width petal_length petal_width]

Note: it is equivalently possible to use the loadings attribute of pca, which produce the same results.

pca.loadings

NDDataset: [float64] unitless (shape: (k:4, x:4))[`IRIS` Dataset_PCA.get_components]

Summary

name

:

`IRIS` Dataset_PCA.get_components

author

:

runner@pkrvmubgrv54qmi

created

:

2025-08-17 01:52:57+00:00

history

:

2025-08-17 01:52:57+00:00> Created using method PCA.get_components

Data

title

:

values

:

...

[[ 0.3614 -0.08452 0.8567 0.3583]
[ 0.6566 0.7302 -0.1734 -0.07548]
[ -0.582 0.5979 0.07624 0.5458]
[ 0.3155 -0.3197 -0.4798 0.7537]]

shape

:

(k:4, x:4)

Dimension `k`

size

:

4

title

:

components

labels

:

[ #0 #1 #2 #3]

Dimension `x`

size

:

4

title

:

features

labels

:

[ sepal_length sepal width petal_length petal_width]

To Reduce the data to a lower dimensionality using these three components, we use the transform methods. The results is often called scores for PCA analysis.

scores = pca.transform()
scores

NDDataset: [float64] unitless (shape: (y:150, k:4))[`IRIS` Dataset_PCA.transform]

Summary

name

:

`IRIS` Dataset_PCA.transform

author

:

runner@pkrvmubgrv54qmi

created

:

2025-08-17 01:52:57+00:00

history

:

2025-08-17 01:52:57+00:00> Created using method PCA.transform

Data

title

:

values

:

...

[[ -2.684 0.3194 -0.02791 0.002262]
[ -2.714 -0.177 -0.2105 0.09903]
...
[ 1.901 0.1166 0.7233 0.0446]
[ 1.39 -0.2827 0.3629 -0.155]]

shape

:

(y:150, k:4)

Dimension `k`

size

:

4

title

:

components

labels

:

[ #0 #1 #2 #3]

Dimension `y`

size

:

150

title

:

samples

labels

:

[ setosa setosa ... virginica virginica]

Again, we can also use the scores attribute to get this results

scores = pca.scores
scores

NDDataset: [float64] unitless (shape: (y:150, k:4))[`IRIS` Dataset_PCA.transform]

Summary

name

:

`IRIS` Dataset_PCA.transform

author

:

runner@pkrvmubgrv54qmi

created

:

2025-08-17 01:52:57+00:00

history

:

2025-08-17 01:52:57+00:00> Created using method PCA.transform

Data

title

:

values

:

...

[[ -2.684 0.3194 -0.02791 0.002262]
[ -2.714 -0.177 -0.2105 0.09903]
...
[ 1.901 0.1166 0.7233 0.0446]
[ 1.39 -0.2827 0.3629 -0.155]]

shape

:

(y:150, k:4)

Dimension `k`

size

:

4

title

:

components

labels

:

[ #0 #1 #2 #3]

Dimension `y`

size

:

150

title

:

samples

labels

:

[ setosa setosa ... virginica virginica]

The figures of merit (explained and cumulative variance) confirm that these 4 PC’s explain 100% of the variance:

pca.printev()

These figures of merit can also be displayed graphically

The ScreePlot

pca.screeplot()

(<Matplotlib Axes object>, <Axes: xlabel='components $\\mathrm{}$', ylabel='cumulative explained variance $\\mathrm{/\\ \\mathrm{\\%}}$'>)

The score plots can be used for classification purposes. The first one - in 2D for the 2 first PC’s - shows that the first PC allows distinguishing Iris-setosa (score of PC#1 < -1) from other species (score of PC#1 > -1), while more PC’s are required to distinguish versicolor from viginica.

pca.scoreplot(scores, 1, 2, color_mapping="labels")

<Axes: title={'center': 'Score plot'}, xlabel='PC# 1 (92.462%)', ylabel='PC# 2 (5.307%)'>

The second one - in 3D for the 3 first PC’s - indicates that a thid PC won’t allow better distinguishing versicolor from viginica.

ax = pca.scoreplot(scores, 1, 2, 3, color_mapping="labels")
ax.view_init(10, 75)

This ends the example ! The following line can be uncommented if no plot shows when running the .py script with python

# scp.show()

Total running time of the script: (0 minutes 0.550 seconds)