Glossary
- ALS
 ALS stands for
Alternating least squaresminimization. The algorithm at the heart of term:MCR-ALSwhich successively resolves \(C\) and \(St\) by least squares, after application of the relevant constraints. It checks how \(\hat{X} = C \cdot St\) is close to \(X\) and either stops or goes for a new loop.- API
 API stands for
Application Programming Interface, which is a set of methods and protocols for using the SpectroChemPy (especially in Jupyter Notebooks or Jupyter Lab) without knowing all the details of the implementation of these methods or protocols.- array-like
 An object which can be transformed into a 1D dataset such as a list or tuple of numbers, or a 2D or nD dataset such a list of lists or a
ndarray.- AsLS
 AsLS stands for
Asymmetric Least Squaressmoothing. This method uses a smoother with asymmetric deviation weighting to obtain a baseline estimator. In doing so, this processor is able to quickly establish and correct a baseline while retaining information on the signal peaks.- Carroucell
 Multisample FTIR cell as described in Zholobenko et al. [2020].
- closure
 - closures
 Constraint where the sum of concentrations is fixed to a target value.
- EFA
 EFA stands for
Evolving Factor Analysis. EFA examines the evolution of the singular values or rank of a dataset \(X\) by systematically carrying out a PCA of submatrices of \(X\). It is often used to guess predminance regions of appearing/disappearing species in an evolving mixture. See [Maeder and Zuberbuehler, 1986] for the original case study and [Maeder and de Juan, 2009] for more recent references.- ICA
 Independant Component Analysis. ICA is a method for separating a multivariate signal into additive subcomponents.- loading
 - loadings
 In the context of PCA, loadings are vectors \(\mathbf{p}_i\) of length n_features which, associated to the corresponding score vectors, are related to the so-called i-th principal component describing the variance of a datastet \(X\).
- MCR-ALS
 MCR-ALS stands for
Multivariate Curve Resolution by Alternating Least Squares. MCR-ALS resolve’s a set of spectra \(X\) of an evolving mixture into the spectral profiles \(S\) of “pure” species and their concentration profiles \(C\), such as:\[X = C \cdot S^T + E\]subjected to various soft constraints (such as non-negativity, unimodality, closure …) or hard constraints (e.g. equality of concention(s) or of some spectra to given profiles).
- n_components
 Number of underlying components or latent variables for spectroscopic data.
- n_features
 Number of
features. A feature for a spectroscopicobservation(spectra) is generally a measurement at a single frequency/energy or any derived quantity.- n_observations
 Number of
observations. When dealing with spectroscopic data, anobservationis generally a single spectra record.- n_targets
 Number of
targets. A target is a property to predict using cross-decomposition methods such as PLS. Typically a target is a composition variable such as a concentration.- NMF
 NMF stands for
Non-negative Matrix Factorization. NMF is a method for factorizing a non-negative matrix \(X\) into two non-negative matrices \(W\) and \(H\) such as \(X = W \cdot H\). NMF is often used for feature extraction and dimensionality reduction.- PCA
 Principal Component Analysis. PCA is directly related to the SVD. Its master equation is:\[\mathbf{X} = \mathbf{T} \mathbf{P}^t + \mathbf{E}\]where \(\mathbf{T} \equiv U \Sigma\) is called the scores matrix and \(\mathbf{P}^t \equiv \mathbf{V}^t\) the loadings matrix. The columns of \(\mathbf{T}\) are called the score vectors and the lines of \(\mathbf{P}^t\) are called loading vectors. Together, the n-th score and loading vectors are related to a latent variable called the n-th principal component.
Hence, \(\mathbf{T}\) and \(\mathbf{P}\) can then be viewed as collections of \(n\) and \(m\) vectors in k-dimensional spaces in which each observation/spectrum or feature/wavelength can be located.
i-th principal component describing the variance of a datastet \(X\).
- PLS
 Partial Least Squaresregression (or Projection on Latent Structures) is a statistical method to estimate \(n \times l\) dependant or predicted variables \(Y\) from \(n \times m\) explanatory or observed variables \(X\) by projecting both of them on new spaces spanned by \(k\) latent variables according to the master equations :\[X = S_X L_X^T + E_X\]\[Y = S_Y L_Y^T + E_Y\]\[S_X, S_y = \textrm{argmax}_{S_X, S_Y}(\textrm{cov}(S_X, S_Y))\]\(S_X\) and \(S_Y\) are \(n \times k\) matrices often called X- and Y-score matrices, and \(L_X^T\) and \(L_Y^T\) are, respectively, \(k \times l\) and \(k \times m\) X- and Y-loading matrices. Matrices \(E_X\) and \(E_Y\) are the error terms or residuals. As indicated by the third equation, the decompositions of \(X\) and \(Y\) are made to maximise the covariance of the score matrices.
- rank
 Number of linearly independent number or columns of a matrix
- regularization
 Technique used to reduce the errors of over-fitting a function on given data by adding, e.g. a smoothness constraint which extent is tuned by a regularization parameter \(\lambda\) .
- score
 - scores
 In the context of PCA, scores are vectors \(\mathbf{t}_i\) of length n_observations which, associated to the corresponding loading vectors, are related to the so-called
- SVD
 SVD stands for
Singular Value Decomposition. SVD decomposes a matrix \(\mathbf{X}(n,m)\) (typically of set of \(n\) spectra) as:\[\mathbf{X} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^t + \mathbf{E}\]where \(\mathbf{U}(n,k)\) and \(\mathbf{V}^t(k,m)\) are matrices regrouping so-called left and right singular vectors of size \(k \leq \min(n,m)\). The factorization is exact (null error \(E\)) whan \(k = \min(n,m)\). Among other properties, left and right singular vectors form two orthonormal basis of \(k\)-dimensional spaces. Hence, for \(\mathbf{U}\):
\[\mathbf{u}_i\mathbf{u}_j^t = \delta_{ij}\]\(\Sigma\) is a diagonal \(k\times k\) matrix which diagonal elements \(\sigma_i\) are called the singular values of the matrix \(X\). The number \(r\) of non-negligible (formally non-null) sigular values is called the rank of \(X\) and determines the number of linear independent lines or columns of \(X\).
The singular values \(\sigma_i\) are generally chosen in descending order, so that the first component - \(\sigma_1 \mathbf{u}_1\mathbf{v}_1^t\) models most of the dataset \(\mathbf{X}\), the second component models most of the remaining part of \(\mathbf{X}\), etc… Overall, the dataset can thus be reconstructed by the sum of the first \(r\) components:
\[\mathbf{X} = \sum_{i=1}^{r} \sigma_i \mathbf{u}_i\mathbf{u}_j^t\]Finally, the sum of these singular values is equal to the total variance of the spectra and each singular value represents the amount of variance captured by each component:
\[\% \textrm{variance explained} = \frac{\sigma_i}{\sum_{i= 1}^r \sigma_i} \times 100\]- unimodality
 Constraint where the profile has a single maximum or minimum.