Difference between revisions of "Talk:Chemometrics"

From validwiki
Jump to: navigation, search
 
(25 intermediate revisions by 2 users not shown)
Line 1: Line 1:
==Theory of chemometric methods==
+
== Great Start, Suggest making the Flow Chart more central ==
===Chemometric methods for 1D NMR data===
 
The primary aims of chemometrics are to separate useful information from noise and to find the crucial patterns in the experimental data. The central idea is to reduce the dimensionality of the data consisting of a large number of measured variables while retaining as much useful information present in it as possible.
 
  
Most of the chemometric methods are based on the idea of latent variables (LVs). Initial (measured) variables can be combined and described by a fewer number of LVs, which describe the underlying structure of the data. The oldest and most common LV projection method is principal component analysis (PCA). PCA is based on transformation of the original data into a new set of a few orthogonal LVs (principal components, PCs), which describes most of the variation in the data. The first PC accounts for the maximal variation of data, while each successive PC does not correlate with the previous PCs and expresses as much of the remaining information as possible. PCA is a helpful data visualization technique: since each object gets a score value on each PC, objects can be presented in score plots. Score plots can reveal patterns, trends and outliers in the data.
+
As mentioned in e-mail, this is a great start!  I plan to do some gentle editing, but I was wondering, shall we make the Flow Chart more central / place it earlier?  To me, the steps in the flow chart provide a natural outline of how the entire page could be organized. I'd move it a little earlier, then move some existing material around so that each box in the flow chart corresponds to some level of header. For example, I would move the current section 1 to be part of two new sections: supervised modeling (or maybe just modeling) and unsupervised modeling (perhaps renamed to Exploratory Data Analysis). Thoughts? I'm prepared to make the changes if there is no objection.--[[User:BryanHanson|BryanHanson]] ([[User talk:BryanHanson|talk]]) 16:27, 28 August 2017 (EDT)
 
 
The multivariate curve resolution-alternating least squares (MCR-ALS) method is one of the advanced chemometric methods used for the exploration of NMR data. MCR-ALS mathematically decomposes a global mixed instrumental response matrix '''D''' containing raw mixture signals into two data matrices '''C''' and '''S<sup>T</sup>''', representing pure response profiles and their contributions to mixed signal, respectively. Singular value decomposition (SVD), the first step in MCR-ALS optimization, provides the data description as orthogonal vectors, and determines the rank (number of significant components) of a data matrix. After that, the ALS algorithm iteratively searches for the matrices '''C''' and '''S<sup>T</sup>''' that mostly fit in the initial data. Additional knowledge, when existing, can be used to reduce ambiguity of MCR-ALS results. Introduction of this information is carried out via implementation of a number of constraints, the most commonly used of which are nonnegativity, unimodality, closure and initial estimates.
 
 
 
MCR-ALS is an algorithm that can extract analyte profiles in the presence of interferences even in the case of a high degree of overlap. This means that, contrary to partial least squares regression (PLSR, see below), quantification of samples containing unexpected components, not represented in the calibration mixtures, is possible.
 
 
 
For quantitative determination of multiple analytes with spectral overlap, multivariate regression methods can be also used. PLSR is considered a standard method of this category. PLSR is a linear regression based method for relating a set of collinear and noisy predictor variables, '''X''' (for example, spectra profiles), with one or more response variables, '''Y''' (analyte concentrations). PLSR reduces data dimensionality by calculating a set of LVs, and each of them is checked for predictive power. In PLSR, both '''X''' and '''Y''' matrices are decomposed like in PCA providing the score and loading vectors, which, however, differ from those provided by PCA, and describe a relation between '''X''' and '''Y'''. The number of significant PLS factors in a calibration model can be determined using cross-validation. The most time-consuming part of the PLSR approach is the preparation of a representative calibration set with known concentration of analytes.
 
 
 
Independent component analysis (ICA), another signal processing technique, was proposed for solving the blind-source separation problem. The main idea of ICA is to perform mathematical transformation of data to a linear combination of statistically independent components (ICs). ICA minimizes mutual dependence of unmixed sources reconstructed from the detected total signal. This method imposes the components profiles to follow different statistical criteria rather than maximization of the explained variance in the data under meaningful constraints in MCR and PCA.
 
 
 
===Chemometrics for 2D NMR data===
 
Several chemometric methods are available for modeling of 2D NMR signals (such as DOSY, HSQC, HMBC, TOCSY, etc). It is considered that parallel factor analysis (PARAFAC) is the most advanced method for the investigation of high-dimensional data. PARAFAC is a generalization of PCA for higher order arrays, which uses an ALS algorithm to iteratively find mixture components, and each of them consists of three informative directions, namely, one score vector and two loadings vectors, as well as a matrix of error. Like in MCR-ALS, constraining the PARAFAC solutions by means of orthogonality and non-negativity criteria can be helpful in terms of interpretability and stability of the solution.
 
 
 
Loadings resolved using PARAFAC methods are not hampered with the rotational ambiguity, and are, therefore, directly interpretable. The results of the PARAFAC model are assessed using model fit and corcondia (core consistency), where the latter should be equal to 100% in the ideal case.
 
 
 
The improved PARAFAC algorithm, PARAFAC2, decomposes higher data arrays into loading matrices, but does not impose such strong restrictions on the data structure. PARAFAC2 does not assume that the shape (or even length) of the elution profile of an analyte is the same in each sample; therefore, it is more sensitive to noise.
 
 
 
Tucker3 is another generalization of PCA for higher order data. Elements of the experimental 3D matrix are decomposed into three loading matrices and a core tensor. Like PCA, Tucker3 has rotational freedom and any model can be rotated without changing the precision of the description of experimental data. Unlike in PARAFAC analysis, the core tensor, '''G''', is not superdiagonal and allows analysis of interaction between different modes.
 
 
 
==Flow chart for performing chemometric analysis of NMR data==
 
Fig. 1 shows a workflow for the development of a multivariate model based on NMR data. It contains some important steps such as sample preparation, spectra acquisition, preprocessing, exploratory analysis, as well as multivariate modelling (including validation and optimization).
 
 
 
==Resources==
 
===Basic literature to get familiar with the theory of chemometrics===
 
===Reviews on the combination of NMR with chemometrics===
 
===Useful links===
 

Latest revision as of 15:29, 28 August 2017

Great Start, Suggest making the Flow Chart more central

As mentioned in e-mail, this is a great start! I plan to do some gentle editing, but I was wondering, shall we make the Flow Chart more central / place it earlier? To me, the steps in the flow chart provide a natural outline of how the entire page could be organized. I'd move it a little earlier, then move some existing material around so that each box in the flow chart corresponds to some level of header. For example, I would move the current section 1 to be part of two new sections: supervised modeling (or maybe just modeling) and unsupervised modeling (perhaps renamed to Exploratory Data Analysis). Thoughts? I'm prepared to make the changes if there is no objection.--BryanHanson (talk) 16:27, 28 August 2017 (EDT)