Baseline corrections

This tutorial shows how to make baseline corrections with spectrochempy. As prerequisite, the user is expected to have read the Import and Import IR tutorials.

[1]:
import spectrochempy as scp
  SpectroChemPy's API - v.0.6.5
© Copyright 2014-2023 - A.Travert & C.Fernandez @ LCS

Now let’s import and plot a typical IR dataset which was recorded during the removal of ammonia from a NH4-Y zeolite:

[2]:
X = scp.read_omnic("irdata/nh4y-activation.spg")
X[:, 1290.0:890.0] = scp.MASKED

After setting some plotting preferences and plot it

[3]:
prefs = X.preferences
prefs.figure.figsize = (7, 3)
prefs.colormap = "magma"
_ = X.plot()
../../_images/userguide_processing_baseline_5_0.png

Background subtraction

Often, particularly for surface species, the baseline is first corrected by subtracting a reference spectrum. In this example, it could be, for instance, the last spectrum (index -1). Hence:

[4]:
Xdiff = X - X[-1]
_ = Xdiff.plot()
../../_images/userguide_processing_baseline_7_0.png

Detrend

Other simple baseline corrections - often use in preprocessing prior chemometric analysis - constist in shifting the spectra or removing a linear trend. This is done using the detrend() method, which is a wrapper of the [ detrend() method] (https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.detrend.html) from the scipy.signal module to which we refer the interested reader.

Linear trend

Subtract the linear trend of each spectrum (type=’linear’, default)

[5]:
_ = X.detrend().plot()
../../_images/userguide_processing_baseline_10_0.png

Constant trend

Subtract the average absorbance to each spectrum

[6]:
_ = X.detrend(type="constant").plot()
../../_images/userguide_processing_baseline_12_0.png

Automatic linear baseline correction abc

When the baseline to remove is a simple linear correction, one can use abc . This performs an automatic baseline correction.

[7]:
_ = scp.abc(X).plot()
../../_images/userguide_processing_baseline_15_0.png

Advanced baseline correction

‘Advanced’ baseline correction basically consists for the user to choose:

  • spectral ranges which s/he considers as belonging to the baseline - the type of polynomial(s) used to model the baseline in and between these regions (keyword: interpolation ) - the method used to apply the correction to spectra: sequentially to each spectrum, or using a multivariate approach (keyword: method ).

Range selection

Each spectral range is defined by a list of two values indicating the limits of the spectral ranges, e.g. [4500., 3500.] to select the 4500-3500 cm\(^{-1}\) range. Note that the ordering has no importance and using [3500.0, 4500.] would lead to exactly the same result. It is also possible to formally pick a single wavenumber 3750. .

The first step is then to select the various regions that we expect to belong to the baseline

[8]:
ranges = [5900.0, 5400.0], 4550.0, [4500.0, 4000.0], [2100.0, 2000.0], [1550.0, 1555.0]

After selection of the baseline ranges, the baseline correction can be made using a sequence of 2 commands:

  1. Initialize an instance of BaselineCorrection

[9]:
blc = scp.BaselineCorrection(X)
  1. compute baseline other the ranges

[10]:
Xcorr = blc.compute(ranges)
Xcorr
[10]:
name nh4y-activation
author runner@fv-az626-878
created 2023-06-06 01:31:00+00:00
history
2023-06-06 01:30:57+00:00> Imported from spg file /home/runner/.spectrochempy/testdata/irdata/nh4y-activation.spg.
2023-06-06 01:30:57+00:00> Sorted by date
2023-06-06 01:31:00+00:00> Baseline correction. Method:
2023-06-06 01:31:00+00:00> Sequential.
2023-06-06 01:31:00+00:00> Interpolation: Polynomial, order=1.
DATA
title absorbance
values
[[ -0.1746 -0.1708 ... 1.433 1.432]
[ -0.1709 -0.1666 ... 1.382 1.382]
...
[ -0.1653 -0.1671 ... 1.53 1.53]
[ -0.1674 -0.1675 ... 1.546 1.544]] a.u.
shape (y:55, x:5549)
DIMENSION `x`
size 5549
title wavenumbers
coordinates
[ 6000 5999 ... 650.9 649.9] cm⁻¹
DIMENSION `y`
size 55
title acquisition timestamp (GMT)
coordinates
[1.468e+09 1.468e+09 ... 1.468e+09 1.468e+09] s
labels
[[ 2016-07-06 19:03:14+00:00 2016-07-06 19:13:14+00:00 ... 2016-07-07 04:03:17+00:00 2016-07-07 04:13:17+00:00]
[ vz0466.spa, Wed Jul 06 21:00:38 2016 (GMT+02:00) vz0467.spa, Wed Jul 06 21:10:38 2016 (GMT+02:00) ...
vz0520.spa, Thu Jul 07 06:00:41 2016 (GMT+02:00) vz0521.spa, Thu Jul 07 06:10:41 2016 (GMT+02:00)]]
  • plot the result (blc.corrected.plot() would lead to the same result)

[11]:
_ = Xcorr.plot()
../../_images/userguide_processing_baseline_23_0.png

Interpolation method

The previous correction was made using the default parameters for the interpolation ,i.e. an interpolation using cubic Hermite spline interpolation: interpolation='pchip' (pchip stands for Piecewise Cubic Hermite Interpolating Polynomial). This option triggers the use of scipy.interpolate.PchipInterpolator() to which we refer the interested readers. The other interpolation method is the classical polynomial interpolation (interpolation='polynomial' ) in which case the order can also be set (e.g. order=3 , the default value being 6). In this case, the base methods used for the interpolation are those of the polynomial module of spectrochempy, in particular the polyfit() method.

For instance:

First, we put the ranges in a list

[12]:
ranges = [[5900.0, 5400.0], [4000.0, 4500.0], [2100.0, 2000.0], [1550.0, 1555.0]]

Warning

if you use a tuple to define the sequences of ranges:

ranges = [5900.0, 5400.0], [4000., 4500.], [2100., 2000.0], [1550., 1555.]

or

ranges = ([5900.0, 5400.0], [4000., 4500.], [2100., 2000.0], [1550., 1555.])

then you can call compute by directly pass the ranges tuple, or you can unpack it as below.

blc.compute(ranges, ....)

if you use a list instead of tuples:

ranges = [[5900.0, 5400.0], [4000., 4500.], [2100., 2000.0], [1550., 1555.]]

then you MUST UNPACK the element when calling compute:

blc.compute(*ranges, ....)
[13]:
blc = scp.BaselineCorrection(X)
blc.compute(*ranges, interpolation="polynomial", order=6)
[13]:
name nh4y-activation
author runner@fv-az626-878
created 2023-06-06 01:31:00+00:00
history
2023-06-06 01:30:57+00:00> Imported from spg file /home/runner/.spectrochempy/testdata/irdata/nh4y-activation.spg.
2023-06-06 01:30:57+00:00> Sorted by date
2023-06-06 01:31:01+00:00> Baseline correction. Method:
2023-06-06 01:31:01+00:00> Sequential.
2023-06-06 01:31:01+00:00> Interpolation: Polynomial, order=6.
DATA
title absorbance
values
[[-0.02411 -0.02017 ... 0.001605 -0.00293]
[-0.03628 -0.03177 ... 0.001021 -0.003108]
...
[-0.01756 -0.01955 ... -0.0004106 -0.003214]
[-0.02053 -0.02076 ... 0.00117 -0.003751]] a.u.
shape (y:55, x:5549)
DIMENSION `x`
size 5549
title wavenumbers
coordinates
[ 6000 5999 ... 650.9 649.9] cm⁻¹
DIMENSION `y`
size 55
title acquisition timestamp (GMT)
coordinates
[1.468e+09 1.468e+09 ... 1.468e+09 1.468e+09] s
labels
[[ 2016-07-06 19:03:14+00:00 2016-07-06 19:13:14+00:00 ... 2016-07-07 04:03:17+00:00 2016-07-07 04:13:17+00:00]
[ vz0466.spa, Wed Jul 06 21:00:38 2016 (GMT+02:00) vz0467.spa, Wed Jul 06 21:10:38 2016 (GMT+02:00) ...
vz0520.spa, Thu Jul 07 06:00:41 2016 (GMT+02:00) vz0521.spa, Thu Jul 07 06:10:41 2016 (GMT+02:00)]]

The corrected attribute contains the corrected NDDataset.

[14]:
_ = blc.corrected.plot()
../../_images/userguide_processing_baseline_30_0.png

Multivariate method

The method option defines whether the selected baseline regions of the spectra should be taken ‘as is’ this is the default method='sequential' ), or modeled using a multivariate approach (method='multivariate' ).

The 'multivariate' option is useful when the signal‐to‐noise ratio is low and/or when the baseline changes in various regions of the spectrum are correlated. It consist in (i) modeling the baseline regions by a principal component analysis (PCA), (ii) interpolate the loadings of the first principal components over the whole spectral and (iii) modeling the spectra baselines from the product of the PCA scores and the interpolated loadings. (for detail: see Vilmin et al. Analytica Chimica Acta 891 (2015)).

If this option is selected, the user should also choose npc , the number of principal components used to model the baseline. In a sense, this parameter has the same role as the order parameter, except that it will affect how well the baseline fits the selected regions, but on both dimensions: wavelength and acquisition time. In particular a large value of npc will lead to overfit of baseline variation with time and will lead to the same result as the sequential method while a too small value would miss important principal component underlying the baseline change over time. Typical optimum values are npc=2 or npc=3 (see Exercises below).

[15]:
blc = scp.BaselineCorrection(X)
blc.compute(*ranges, interpolation="pchip", method="multivariate", npc=2)
_ = blc.corrected.plot()
../../_images/userguide_processing_baseline_32_0.png

Code snippet for ‘advanced’ baseline correction

The following code in which the user can change any of the parameters and look at the changes after re-running the cell:

[16]:
# user defined parameters
# -----------------------
ranges = (
    [5900.0, 5400.0],
    [4000.0, 4500.0],
    4550.0,
    [2100.0, 2000.0],
    [1550.0, 1555.0],
    [1250.0, 1300.0],
    [800.0, 850.0],
)
interpolation = "pchip"  # choose 'polynomial' or 'pchip'
order = 5  # only used for 'polynomial'
method = "sequential"  # choose 'sequential' or 'multivariate'
npc = 3  # only used for 'multivariate'

# code: compute baseline, plot original and corrected NDDatasets and ranges
# --------------------------------------------------------------------------------------
blc = scp.BaselineCorrection(X)
Xcorr = blc.compute(
    *ranges, interpolation=interpolation, order=order, method=method, npc=npc
)

axes = scp.multiplot(
    [X, Xcorr],
    labels=["Original", "Baseline corrected"],
    sharex=True,
    nrow=2,
    ncol=1,
    figsize=(7, 6),
    dpi=96,
)
blc.show_regions(axes["axe21"])
../../_images/userguide_processing_baseline_34_0.png

Widget for “advanced” baseline corrections

The BaselineCorrector widget can be used in either Jupyter notebook or Jupyter lab.

The buttons are the following: * upload: allows to upload new data. * process : baseline correct and plot original dataset + baseline and corrected datasets * save as: save the baseline corrected dataset.

The x slice and y slice text boxes can be used to slice the original dataset with the usual [start:stop:step] format. In both dimensions, coordinates or indexes can be used (for example, [3000.0::2] or [:100:5] are valid entries).

The Method and Interpolation dropdown fields are self-explanatory, see above for details.

Ranges must be entered as a tuple of digits or wave numbers, e.g. ([5900.0, 5400.0], 2000.0, [1550.0, 1555.0],) .

[17]:
X = scp.read_omnic("irdata/nh4y-activation.spg")
out = scp.BaselineCorrector(X)

After processing, one can get the original (sliced) dataset, corrected dataset and baselines through the following attributes:

[18]:
out.original, out.corrected, out.baseline
[18]:
(NDDataset: [float64] a.u. (shape: (y:55, x:5549)),
 NDDataset: [float64] a.u. (shape: (y:55, x:5549)),
 NDDataset: [float64] a.u. (shape: (y:55, x:5549)))

Exercises

basic: - write commands to subtract (i) the first spectrum from a dataset and (ii) the mean spectrum from a dataset - write a code to correct the baseline of the last 10 spectra of the above dataset in the 4000-3500 cm\(^{-1}\) range

intermediate: - what would be the parameters to use in ‘advanced’ baseline correction to mimic ‘detrend’ ? Write a code to check your answer.

advanced: - simulate noisy spectra with baseline drifts and compare the performances of multivariate vs sequential methods