PCA analysis example

In this example, we perform the PCA dimensionality reduction of a spectra dataset

Import the spectrochempy API package

import spectrochempy as scp

Load a dataset

dataset = scp.read_omnic("irdata/nh4y-activation.spg")[::5]
print(dataset)
_ = dataset.plot()
plot pca spec
NDDataset: [float64] a.u. (shape: (y:11, x:5549))

Create a PCA object and fit the dataset so that the explained variance is greater or equal to 99.9%

pca = scp.PCA(n_components=0.999)
pca.fit(dataset)
<spectrochempy.analysis.pca.PCA object at 0x7f15af643b80>

The number of fitted components is given by the n_components attribute (We obtain 23 components)

6

Transform the dataset to a lower dimensionality using all the fitted components

name nh4y-activation_PCA.transform
author runner@fv-az626-878
created 2023-06-06 01:27:26+00:00
history
2023-06-06 01:27:26+00:00> Created using method PCA.transform
DATA
title
values
[[ 71.56 16.87 ... -0.006776 -0.216]
[ 50.39 -8.205 ... 0.5566 0.03804]
...
[ -26.25 1.694 ... 0.1281 -0.6077]
[ -25.46 1.456 ... 4.175 0.7883]]
shape (y:11, x:6)
DIMENSION `x`
size 6
title components
labels
[ #0 #1 #2 #3 #4 #5]
DIMENSION `y`
size 11
title acquisition timestamp (GMT)
coordinates
[1.468e+09 1.468e+09 ... 1.468e+09 1.468e+09] s
labels
[[ 2016-07-06 19:03:14+00:00 2016-07-06 19:53:14+00:00 ... 2016-07-07 02:43:15+00:00 2016-07-07 03:33:17+00:00]
[ vz0466.spa, Wed Jul 06 21:00:38 2016 (GMT+02:00) vz0471.spa, Wed Jul 06 21:50:37 2016 (GMT+02:00) ...
vz0512.spa, Thu Jul 07 04:40:39 2016 (GMT+02:00) vz0517.spa, Thu Jul 07 05:30:41 2016 (GMT+02:00)]]


Finally, display the results graphically ScreePlot

  • Scree plot
  • plot pca spec

Score Plot

Score plot

Score Plot for 3 PC’s in 3D

_ = pca.scoreplot(scores, 1, 2, 3)
Score plot

Displays 4 loadings

_ = pca.loadings[:4].plot(legend=True)
plot pca spec

Here we do a masking of the saturated region between 882 and 1280 cm^-1

dataset[
    :, 882.0:1280.0
] = scp.MASKED  # remember: use float numbers for slicing (not integer)
_ = dataset.plot()
plot pca spec

Apply the PCA model

pca = scp.PCA(n_components=0.999)
pca.fit(dataset)
pca.n_components
3

As seen above, now only 4 components instead of 23 are necessary to 99.9% of explained variance.

  • Scree plot
  • plot pca spec

Displays the loadings

_ = pca.loadings.plot(legend=True)
plot pca spec

Let’s plot the scores

Score plot

Our dataset has already two columns of labels for the spectra but there are little too long for display on plots.

array([[  2016-07-06 19:03:14+00:00,   vz0466.spa, Wed Jul 06 21:00:38 2016 (GMT+02:00)],
       [  2016-07-06 19:53:14+00:00,   vz0471.spa, Wed Jul 06 21:50:37 2016 (GMT+02:00)],
       ...,
       [  2016-07-07 02:43:15+00:00,   vz0512.spa, Thu Jul 07 04:40:39 2016 (GMT+02:00)],
       [  2016-07-07 03:33:17+00:00,   vz0517.spa, Thu Jul 07 05:30:41 2016 (GMT+02:00)]], dtype=object)

So we define some short labels for each component, and add them as a third column:

labels = [lab[:6] for lab in dataset.y.labels[:, 1]]
scores.y.labels = labels  # Note this does not replace previous labels,
# but adds a column.

now display thse

_ = pca.scoreplot(scores, 1, 2, show_labels=True, labels_column=2)
Score plot

This ends the example ! The following line can be uncommented if no plot shows when running the .py script

# scp.show()

Total running time of the script: ( 0 minutes 1.846 seconds)

Gallery generated by Sphinx-Gallery