|
|
(22 intermediate revisions by 2 users not shown) |
Line 1: |
Line 1: |
− | ==Theory of chemometric methods== | + | == Great Start, Suggest making the Flow Chart more central == |
− | ===Chemometric methods for 1D NMR data=== | |
− | The primary aims of chemometrics are to separate useful information from noise and to find the crucial patterns in the experimental data. The central idea is to reduce the dimensionality of the data consisting of a large number of measured variables while retaining as much useful information present in it as possible.
| |
| | | |
− | Most of the chemometric methods are based on the idea of latent variables (LVs). Initial (measured) variables can be combined and described by a fewer number of LVs, which describe the underlying structure of the data. The oldest and most common LV projection method is principal component analysis (PCA). PCA is based on transformation of the original data into a new set of a few orthogonal LVs (principal components, PCs), which describes most of the variation in the data. The first PC accounts for the maximal variation of data, while each successive PC does not correlate with the previous PCs and expresses as much of the remaining information as possible. PCA is a helpful data visualization technique: since each object gets a score value on each PC, objects can be presented in score plots. Score plots can reveal patterns, trends and outliers in the data.
| + | As mentioned in e-mail, this is a great start! I plan to do some gentle editing, but I was wondering, shall we make the Flow Chart more central / place it earlier? To me, the steps in the flow chart provide a natural outline of how the entire page could be organized. I'd move it a little earlier, then move some existing material around so that each box in the flow chart corresponds to some level of header. For example, I would move the current section 1 to be part of two new sections: supervised modeling (or maybe just modeling) and unsupervised modeling (perhaps renamed to Exploratory Data Analysis). Thoughts? I'm prepared to make the changes if there is no objection.--[[User:BryanHanson|BryanHanson]] ([[User talk:BryanHanson|talk]]) 16:27, 28 August 2017 (EDT) |
− | | |
− | The multivariate curve resolution-alternating least squares (MCR-ALS) method is one of the advanced chemometric methods used for the exploration of NMR data. MCR-ALS mathematically decomposes a global mixed instrumental response matrix '''D''' containing raw mixture signals into two data matrices '''C''' and '''S<sup>T</sup>''', representing pure response profiles and their contributions to mixed signal, respectively. Singular value decomposition (SVD), the first step in MCR-ALS optimization, provides the data description as orthogonal vectors, and determines the rank (number of significant components) of a data matrix. After that, the ALS algorithm iteratively searches for the matrices '''C''' and '''S<sup>T</sup>''' that mostly fit in the initial data. Additional knowledge, when existing, can be used to reduce ambiguity of MCR-ALS results. Introduction of this information is carried out via implementation of a number of constraints, the most commonly used of which are nonnegativity, unimodality, closure and initial estimates.
| |
− | | |
− | MCR-ALS is an algorithm that can extract analyte profiles in the presence of interferences even in the case of a high degree of overlap. This means that, contrary to partial least squares regression (PLSR, see below), quantification of samples containing unexpected components, not represented in the calibration mixtures, is possible.
| |
− | | |
− | For quantitative determination of multiple analytes with spectral overlap, multivariate regression methods can be also used. PLSR is considered a standard method of this category. PLSR is a linear regression based method for relating a set of collinear and noisy predictor variables, '''X''' (for example, spectra profiles), with one or more response variables, '''Y''' (analyte concentrations). PLSR reduces data dimensionality by calculating a set of LVs, and each of them is checked for predictive power. In PLSR, both '''X''' and '''Y''' matrices are decomposed like in PCA providing the score and loading vectors, which, however, differ from those provided by PCA, and describe a relation between '''X''' and '''Y'''. The number of significant PLS factors in a calibration model can be determined using cross-validation. The most time-consuming part of the PLSR approach is the preparation of a representative calibration set with known concentration of analytes. | |
− | | |
− | Independent component analysis (ICA), another signal processing technique, was proposed for solving the blind-source separation problem. The main idea of ICA is to perform mathematical transformation of data to a linear combination of statistically independent components (ICs). ICA minimizes mutual dependence of unmixed sources reconstructed from the detected total signal. This method imposes the components profiles to follow different statistical criteria rather than maximization of the explained variance in the data under meaningful constraints in MCR and PCA.
| |
− | | |
− | ===Chemometrics for 2D NMR data===
| |
− | Several chemometric methods are available for modeling of 2D NMR signals (such as DOSY, HSQC, HMBC, TOCSY, etc). It is considered that parallel factor analysis (PARAFAC) is the most advanced method for the investigation of high-dimensional data. PARAFAC is a generalization of PCA for higher order arrays, which uses an ALS algorithm to iteratively find mixture components, and each of them consists of three informative directions, namely, one score vector and two loadings vectors, as well as a matrix of error. Like in MCR-ALS, constraining the PARAFAC solutions by means of orthogonality and non-negativity criteria can be helpful in terms of interpretability and stability of the solution.
| |
− | | |
− | Loadings resolved using PARAFAC methods are not hampered with the rotational ambiguity, and are, therefore, directly interpretable. The results of the PARAFAC model are assessed using model fit and corcondia (core consistency), where the latter should be equal to 100% in the ideal case.
| |
− | | |
− | The improved PARAFAC algorithm, PARAFAC2, decomposes higher data arrays into loading matrices, but does not impose such strong restrictions on the data structure. PARAFAC2 does not assume that the shape (or even length) of the elution profile of an analyte is the same in each sample; therefore, it is more sensitive to noise.
| |
− | | |
− | Tucker3 is another generalization of PCA for higher order data. Elements of the experimental 3D matrix are decomposed into three loading matrices and a core tensor. Like PCA, Tucker3 has rotational freedom and any model can be rotated without changing the precision of the description of experimental data. Unlike in PARAFAC analysis, the core tensor, '''G''', is not superdiagonal and allows analysis of interaction between different modes.
| |
− | | |
− | ==Flow chart for performing chemometric analysis of NMR data==
| |
− | Figure 1 shows a workflow for the development of a multivariate model based on NMR data. It contains some important steps such as sample preparation, spectra acquisition, preprocessing, exploratory analysis, as well as multivariate modelling (including validation and optimization). [[File:Fig1_YM_FlowChart.png|thumb|Figure 1. Flow chart to perform multivariate method development for NMR data]]
| |
− | | |
− | ''Sample preparation.'' For high-resolution NMR of liquids (beverages, clinical samples) the samples are often prepared by adding proper solvent (frequently containing buffer) and internal standard. For samples with pH-dependent compounds, additional pH adjustment is required. For chemometric analysis standard procedures should be followed to ensure repeatability and comparability when preparing a series of samples.
| |
− | | |
− | ''Spectra acquisition.'' Spectra acquisition is one of the most challenging steps in optimization of NMR analysis, regardless of the type of approach (i.e., targeted or non-targeted) to be used for spectral evaluation. For multivariate analysis it should be assured that exactly the same pulse program and acquisition parameters are used for ''all'' samples. It should be also checked that the utilized suppression scheme does not affect (or affects equally) the whole series of signals located close to the suppressed region.
| |
− | | |
− | ''Preprocessing.'' Adequate baseline- and phase correction is fundamental for multivariate spectral modelling. The phase and the baseline of the spectra is usually corrected manually or with one of several automatic correction functions from the spectrometer software (Figure 2). [[File:Fig2_YM_BCHMF.png|thumb|Figure 2. Baseline correction by the moving minimum algorithm for <sup>1</sup>H NMR signal of hydroxymethylfurfural in cola sample]]
| |
− | | |
− | A more serious preprocessing problem is the chemical shift variation that may occur from sample to sample or even from peak to peak. The overall sample-to-sample variations are due to small variations in spectrometer frequency and the straightforward solution is a translation of the entire spectrum by an internal reference peak (TSP or TMS). The peak to peak chemical shift variations due to differences in, for example, pH or concentrations are much more difficult to handle. The most widely used method for addressing this chemical shift variability across spectra is by means of the so-called bucketing. The procedure consists of segmenting a spectrum into small areas (buckets) and taking the area under the spectrum for each segment. The bucket width typically varies between 0.01 and 0.05 ppm for <sup>1</sup>H NMR.
| |
− | | |
− | The major drawback of this procedure is the loss of a considerable amount of information enclosed in the original spectra. There are several good alternatives to binning, which involve some form of peak alignment without data reduction. For example, the interval correlation optimized shifting algorithm (e.g., icoshift) [F. Savorani, G. Tomasi, S.B. Engelsen, Icoshift: A versatile tool for the rapid alignment of 1D NMR spectra. J. Magn. Reson. 2010; 202: 190-202] splits a spectral database into intervals and coshifts each vector left-right to get the maximum correlation towards a target spectrum (Figure 3). [[File:Fig3_YM_icoshift.png|thumb|Figure 3. Alignment by the icoshift algorithm for <sup>13</sup>C NMR signals of carboxyl group region in a set of milk samples]]
| |
− | | |
− | ==Resources==
| |
− | ===Basic literature to get familiar with the theory of chemometrics===
| |
− | ===Reviews on the combination of NMR with chemometrics===
| |
− | ===Useful links===
| |