Glossary

ALS

ALS stands for Alternating least squares minimization. The algorithm at the heart of term:MCR-ALS which successively resolves \(C\) and \(St\) by least squares, after application of the relevant constraints. It checks how \(\hat{X} = C \cdot St\) is close to \(X\) and either stops or goes for a new loop.

API

API stands for Application Programming Interface, which is a set of methods and protocols for using the SpectroChemPy (especially in Jupyter Notebooks or Jupyter Lab) without knowing all the details of the implementation of these methods or protocols.

array-like

An object which can be transformed into a 1D dataset such as a list or tuple of numbers, or a 2D or nD dataset such a list of lists or a ndarray.

AsLS

AsLS stands for Asymmetric Least Squares smoothing. This method uses a smoother with asymmetric deviation weighting to obtain a baseline estimator. In doing so, this processor is able to quickly establish and correct a baseline while retaining information on the signal peaks.

Carroucell

Multisample FTIR cell as described in Zholobenko et al. [2020].

closure

closures

Constraint where the sum of concentrations is fixed to a target value.

EFA

EFA stands for Evolving Factor Analysis. EFA examines the evolution of the singular values or rank of a dataset \(X\) by systematically carrying out a PCA of submatrices of \(X\). It is often used to guess predminance regions of appearing/disappearing species in an evolving mixture. See [Maeder and Zuberbuehler, 1986] for the original case study and [Maeder and de Juan, 2009] for more recent references.

ICA

Independant Component Analysis. ICA is a method for separating a multivariate signal into additive subcomponents.

loading

loadings

In the context of PCA, loadings are vectors \(\mathbf{p}_i\) of length n_features which, associated to the corresponding score vectors, are related to the so-called i-th principal component describing the variance of a datastet \(X\).

MCR-ALS

MCR-ALS stands for Multivariate Curve Resolution by Alternating Least Squares . MCR-ALS resolve’s a set of spectra \(X\) of an evolving mixture into the spectral profiles \(S\) of “pure” species and their concentration profiles \(C\), such as:

\[X = C \cdot S^T + E\]

subjected to various soft constraints (such as non-negativity, unimodality, closure …) or hard constraints (e.g. equality of concention(s) or of some spectra to given profiles).

n_components

Number of underlying components or latent variables for spectroscopic data.

n_features

Number of features. A feature for a spectroscopic observation (spectra) is generally a measurement at a single frequency/energy or any derived quantity.

n_observations

Number of observations. When dealing with spectroscopic data, an observation is generally a single spectra record.

n_targets

Number of targets. A target is a property to predict using cross-decomposition methods such as PLS. Typically a target is a composition variable such as a concentration.

NMF

NMF stands for Non-negative Matrix Factorization. NMF is a method for factorizing a non-negative matrix \(X\) into two non-negative matrices \(W\) and \(H\) such as \(X = W \cdot H\). NMF is often used for feature extraction and dimensionality reduction.

PCA

Principal Component Analysis. PCA is directly related to the SVD. Its master equation is:

\[\mathbf{X} = \mathbf{T} \mathbf{P}^t + \mathbf{E}\]

where \(\mathbf{T} \equiv U \Sigma\) is called the scores matrix and \(\mathbf{P}^t \equiv \mathbf{V}^t\) the loadings matrix. The columns of \(\mathbf{T}\) are called the score vectors and the lines of \(\mathbf{P}^t\) are called loading vectors. Together, the n-th score and loading vectors are related to a latent variable called the n-th principal component.

Hence, \(\mathbf{T}\) and \(\mathbf{P}\) can then be viewed as collections of \(n\) and \(m\) vectors in k-dimensional spaces in which each observation/spectrum or feature/wavelength can be located.

i-th principal component describing the variance of a datastet \(X\).

PLS

Partial Least Squares regression (or Projection on Latent Structures) is a statistical method to estimate \(n \times l\) dependant or predicted variables \(Y\) from \(n \times m\) explanatory or observed variables \(X\) by projecting both of them on new spaces spanned by \(k\) latent variables according to the master equations :

\[X = S_X L_X^T + E_X\]

\[Y = S_Y L_Y^T + E_Y\]

\[S_X, S_y = \textrm{argmax}_{S_X, S_Y}(\textrm{cov}(S_X, S_Y))\]

\(S_X\) and \(S_Y\) are \(n \times k\) matrices often called X- and Y-score matrices, and \(L_X^T\) and \(L_Y^T\) are, respectively, \(k \times l\) and \(k \times m\) X- and Y-loading matrices. Matrices \(E_X\) and \(E_Y\) are the error terms or residuals. As indicated by the third equation, the decompositions of \(X\) and \(Y\) are made to maximise the covariance of the score matrices.

rank

Number of linearly independent number or columns of a matrix

regularization

Technique used to reduce the errors of over-fitting a function on given data by adding, e.g. a smoothness constraint which extent is tuned by a regularization parameter \(\lambda\) .

score

scores

In the context of PCA, scores are vectors \(\mathbf{t}_i\) of length n_observations which, associated to the corresponding loading vectors, are related to the so-called

SVD

SVD stands for Singular Value Decomposition. SVD decomposes a matrix \(\mathbf{X}(n,m)\) (typically of set of \(n\) spectra) as:

\[\mathbf{X} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^t + \mathbf{E}\]

where \(\mathbf{U}(n,k)\) and \(\mathbf{V}^t(k,m)\) are matrices regrouping so-called left and right singular vectors of size \(k \leq \min(n,m)\). The factorization is exact (null error \(E\)) whan \(k = \min(n,m)\). Among other properties, left and right singular vectors form two orthonormal basis of \(k\)-dimensional spaces. Hence, for \(\mathbf{U}\):

\[\mathbf{u}_i\mathbf{u}_j^t = \delta_{ij}\]

\(\Sigma\) is a diagonal \(k\times k\) matrix which diagonal elements \(\sigma_i\) are called the singular values of the matrix \(X\). The number \(r\) of non-negligible (formally non-null) sigular values is called the rank of \(X\) and determines the number of linear independent lines or columns of \(X\).

The singular values \(\sigma_i\) are generally chosen in descending order, so that the first component - \(\sigma_1 \mathbf{u}_1\mathbf{v}_1^t\) models most of the dataset \(\mathbf{X}\), the second component models most of the remaining part of \(\mathbf{X}\), etc… Overall, the dataset can thus be reconstructed by the sum of the first \(r\) components:

\[\mathbf{X} = \sum_{i=1}^{r} \sigma_i \mathbf{u}_i\mathbf{u}_j^t\]

Finally, the sum of these singular values is equal to the total variance of the spectra and each singular value represents the amount of variance captured by each component:

\[\% \textrm{variance explained} = \frac{\sigma_i}{\sum_{i= 1}^r \sigma_i} \times 100\]

unimodality

Constraint where the profile has a single maximum or minimum.