Difference between revisions of "Talk:Chemometrics"

From validwiki
Jump to: navigation, search
(Blanked the page)
Line 1: Line 1:
==Theory of chemometric methods==
 
===Chemometric methods for 1D NMR data===
 
The primary aims of chemometrics are to separate useful information from noise and to find the crucial patterns in the experimental data. The central idea is to reduce the dimensionality of the data consisting of a large number of measured variables while retaining as much useful information present in it as possible.
 
  
Most of the chemometric methods are based on the idea of latent variables (LVs). Initial (measured) variables can be combined and described by a fewer number of LVs, which describe the underlying structure of the data. The oldest and most common LV projection method is principal component analysis (PCA). PCA is based on transformation of the original data into a new set of a few orthogonal LVs (principal components, PCs), which describes most of the variation in the data. The first PC accounts for the maximal variation of data, while each successive PC does not correlate with the previous PCs and expresses as much of the remaining information as possible. PCA is a helpful data visualization technique: since each object gets a score value on each PC, objects can be presented in score plots. Score plots can reveal patterns, trends and outliers in the data.
 
 
The multivariate curve resolution-alternating least squares (MCR-ALS) method is one of the advanced chemometric methods used for the exploration of NMR data. MCR-ALS mathematically decomposes a global mixed instrumental response matrix '''D''' containing raw mixture signals into two data matrices '''C''' and '''S<sup>T</sup>''', representing pure response profiles and their contributions to mixed signal, respectively. Singular value decomposition (SVD), the first step in MCR-ALS optimization, provides the data description as orthogonal vectors, and determines the rank (number of significant components) of a data matrix. After that, the ALS algorithm iteratively searches for the matrices '''C''' and '''S<sup>T</sup>''' that mostly fit in the initial data. Additional knowledge, when existing, can be used to reduce ambiguity of MCR-ALS results. Introduction of this information is carried out via implementation of a number of constraints, the most commonly used of which are nonnegativity, unimodality, closure and initial estimates.
 
 
MCR-ALS is an algorithm that can extract analyte profiles in the presence of interferences even in the case of a high degree of overlap. This means that, contrary to partial least squares regression (PLSR, see below), quantification of samples containing unexpected components, not represented in the calibration mixtures, is possible.
 
 
For quantitative determination of multiple analytes with spectral overlap, multivariate regression methods can be also used. PLSR is considered a standard method of this category. PLSR is a linear regression based method for relating a set of collinear and noisy predictor variables, '''X''' (for example, spectra profiles), with one or more response variables, '''Y''' (analyte concentrations). PLSR reduces data dimensionality by calculating a set of LVs, and each of them is checked for predictive power. In PLSR, both '''X''' and '''Y''' matrices are decomposed like in PCA providing the score and loading vectors, which, however, differ from those provided by PCA, and describe a relation between '''X''' and '''Y'''. The number of significant PLS factors in a calibration model can be determined using cross-validation. The most time-consuming part of the PLSR approach is the preparation of a representative calibration set with known concentration of analytes.
 
 
Independent component analysis (ICA), another signal processing technique, was proposed for solving the blind-source separation problem. The main idea of ICA is to perform mathematical transformation of data to a linear combination of statistically independent components (ICs). ICA minimizes mutual dependence of unmixed sources reconstructed from the detected total signal. This method imposes the components profiles to follow different statistical criteria rather than maximization of the explained variance in the data under meaningful constraints in MCR and PCA.
 
 
===Chemometrics for 2D NMR data===
 
Several chemometric methods are available for modeling of 2D NMR signals (such as DOSY, HSQC, HMBC, TOCSY, etc). It is considered that parallel factor analysis (PARAFAC) is the most advanced method for the investigation of high-dimensional data. PARAFAC is a generalization of PCA for higher order arrays, which uses an ALS algorithm to iteratively find mixture components, and each of them consists of three informative directions, namely, one score vector and two loadings vectors, as well as a matrix of error. Like in MCR-ALS, constraining the PARAFAC solutions by means of orthogonality and non-negativity criteria can be helpful in terms of interpretability and stability of the solution.
 
 
Loadings resolved using PARAFAC methods are not hampered with the rotational ambiguity, and are, therefore, directly interpretable. The results of the PARAFAC model are assessed using model fit and corcondia (core consistency), where the latter should be equal to 100% in the ideal case.
 
 
The improved PARAFAC algorithm, PARAFAC2, decomposes higher data arrays into loading matrices, but does not impose such strong restrictions on the data structure. PARAFAC2 does not assume that the shape (or even length) of the elution profile of an analyte is the same in each sample; therefore, it is more sensitive to noise.
 
 
Tucker3 is another generalization of PCA for higher order data. Elements of the experimental 3D matrix are decomposed into three loading matrices and a core tensor. Like PCA, Tucker3 has rotational freedom and any model can be rotated without changing the precision of the description of experimental data. Unlike in PARAFAC analysis, the core tensor, '''G''', is not superdiagonal and allows analysis of interaction between different modes.
 
 
==Flow chart for performing chemometric analysis of NMR data==
 
Figure 1 shows a workflow for the development of a multivariate model based on NMR data. It contains some important steps such as sample preparation, spectra acquisition, preprocessing, exploratory analysis, as well as multivariate modelling (including validation and optimization). [[File:Fig1_YM_FlowChart.png|thumb|Figure 1. Flow chart to perform multivariate method development for NMR data]]
 
 
''Sample preparation.'' For high-resolution NMR of liquids (beverages, clinical samples) the samples are often prepared by adding proper solvent (frequently containing buffer) and internal standard. For samples with pH-dependent compounds, additional pH adjustment is required. For chemometric analysis standard procedures should be followed to ensure repeatability and comparability when preparing a series of samples.
 
 
''Spectra acquisition.'' Spectra acquisition is one of the most challenging steps in optimization of NMR analysis, regardless of the type of approach (i.e., targeted or non-targeted) to be used for spectral evaluation. For multivariate analysis it should be assured that exactly the same pulse program and acquisition parameters are used for ''all'' samples. It should be also checked that the utilized suppression scheme does not affect (or affects equally) the whole series of signals located close to the suppressed region.
 
 
''Preprocessing.'' Adequate baseline- and phase correction is fundamental for multivariate spectral modelling. The phase and the baseline of the spectra is usually corrected manually or with one of several automatic correction functions from the spectrometer software (Figure 2). [[File:Fig2_YM_BCHMF.png|thumb|Figure 2. Baseline correction by the moving minimum algorithm for <sup>1</sup>H NMR signal of hydroxymethylfurfural in cola sample]]
 
 
A more serious preprocessing problem is the chemical shift variation that may occur from sample to sample or even from peak to peak. The overall sample-to-sample variations are due to small variations in spectrometer frequency and the straightforward solution is a translation of the entire spectrum by an internal reference peak (TSP or TMS). The peak to peak chemical shift variations due to differences in, for example, pH or concentrations are much more difficult to handle. The most widely used method for addressing this chemical shift variability across spectra is by means of the so-called bucketing. The procedure consists of segmenting a spectrum into small areas (buckets) and taking the area under the spectrum for each segment. The bucket width typically varies between 0.01 and 0.05 ppm for <sup>1</sup>H NMR.
 
 
The major drawback of this procedure is the loss of a considerable amount of information enclosed in the original spectra. There are several good alternatives to binning, which involve some form of peak alignment without data reduction. For example, the interval correlation optimized shifting algorithm (e.g., icoshift) <ref>Savorani, F., Tomasi, G., Engelsen, S. B. Icoshift: A versatile tool for the rapid alignment of 1D NMR spectra. ''J. Magn. Reson.'' '''2010''', 202: 190-202. https://doi.org/10.1016/j.jmr.2009.11.012</ref> splits a spectral database into intervals and coshifts each vector left-right to get the maximum correlation towards a target spectrum (Figure 3). [[File:Fig3_YM_icoshift.png|thumb|Figure 3. Alignment by the icoshift algorithm for <sup>13</sup>C NMR signals in the carboxyl group region for a set of milk samples]]
 
 
Data reduction facilitates and accelerates chemometric analysis. Elimination of regions with zero intensities as well as regions of solvents and internal reference signals is recommended. After data reduction, further treatment by variable selection methods is recommended. Variable selection is the process of selecting a subset of only relevant features for use in model construction. For variable selection of NMR data clustering of latent variables (CLV) is recommended <ref>Vigneau, E., Qannari, E. M. Clustering of variables around latent components. ''Commun. Stat. Simul. C.'' '''2003''', 32: 1131–1150. http://dx.doi.org/10.1081/SAC-120023882</ref>. Significant variables are selected using a calibration data set where the classes are known in advance.
 
 
Pre-processing also involves mean-centering and scaling. The mean-centered matrix is obtained by subtracting the mean intensity for each of the variables from each spectrum. Scaling methods are data pretreatment approaches that divide each variable by a scaling factor, which is different for each variable. They adjust the variations in fold differences between the different metabolites by converting the data into differences in concentration relative to the scaling factor.
 
 
''Unsupervised modelling (exploratory analysis).'' The detection of outliers and their removal from the calibration set has to be considered prior to building of multivariate models. This could be done by using e.g. Mahalanobis distance, or non-targeted approach (PCA). The multivariate model has to be recalculated without the detected outliers. Outliers also have to be excluded from the validation test set.
 
 
''Supervised modelling.'' All available supervised chemometric methods can be used for analysis of NMR spectra. Generally they aimed at tackling two main objectives: classification of objects or prediction of analytical parameters, whose resonances overlap with signals of other compounds or not observable in the spectrum (see section Theory of chemometric methods).
 
 
It is thus essential that the validation is performed adequately and produces reliable results.  The validation of a multivariate model can be performed by cross-validation or test set validation.
 
 
One round of cross-validation involves partitioning a data set into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds.
 
 
Because cross-validation is optimistic, it should be combined with other techniques such as test set validation. Test set validation should be used if there are enough samples in the data table, for instance more than 50. A test set should contain 20-40% of the full data table. The calibration and test sets should cover the whole sample population.
 
 
The samples used to construct a multivariate model and for its validation have to be authentic and the desired parameter for classification has to be verified (e.g., by ''a priori'' knowledge obtained during sampling or by application of an adequate reference method).
 
 
For classification purposes, each predefined group has to contain as much samples as possible (not less than 20 are recommended). The number of samples in a calibration set has not to be less than 50 for multivariate calibration. Collinearities of variables caused by correlated concentrations in calibration samples have to be avoided. Therefore, the composition of calibration mixtures should be chosen according to experimental design.
 
 
Parameters that have to be validated for the specific purpose are summarized in the [http://www.eurolab.org/documents/NMR%20Val%20Guideline%20II%20V6.pdf Eurolab Guidelines, Part II].
 
 
==References==
 
<references />
 
==Resources==
 
===Basic literature to get familiar with the theory of chemometrics===
 
====Principal component analysis (PCA)====
 
[1] [https://www.researchgate.net/profile/Kim_Esbensen/publication/222347483_Principal_Component_Analysis/links/00b4952c66e796fc2d000000.pdf Wold, S., Esbensen, K., Geladi, P. Principal component analysis. ''Chemometr. Intell. Lab.'' '''1987''', 2: 37-52.] <br>
 
 
====Multivariate regression models====
 
[1] [http://www.ece.mcmaster.ca/faculty/reilly/ece712/tutorial%20on%20PLS%20and%20PCA.pdf Geladi, P., Kowalski, B. R. Partial least-squares regression: a tutorial. ''Anal. Chim. Acta.'' '''1986''', 185: 1-17.]<br>
 
[2] [http://www.iasbs.ac.ir/chemistry/chemometrics/history/4th/5.pdf Wold, S., Sjoestroem, M., Eriksson, L. PLS-regression: a basic tool in chemometrics. ''Chemometr. Intell. Lab.'' '''2001''', 58: 109-130.]<br>
 
====Multivariate curve resolution====
 
[1] [http://dx.doi.org/10.1080/10408340600970005 de Juan, A., Tauler, R. Multivariate curve resolution (MCR) from 2000: Progress in concepts and applications. ''Crit. Rev. Anal. Chem.'' '''2006''', 36: 163-176.]<br>
 
[2] [https://www.researchgate.net/publication/223888647_Chemometrics_applied_to_unravel_multicomponent_processes_and_mixtures_Revisiting_latest_trends_in_multivarate_resolution de Juan, A., Tauler, R. Chemometrics applied to unravel multicomponent processes and mixtures. Revisiting latest trends in multivariate resolution. ''Anal. Chim. Acta.'' '''2003''',  500: 195-210.]<br>
 
[3] [https://doi.org/10.1016/j.aca.2012.12.028 Ruckebusch, C., Blanchet, L. Multivariate curve resolution: A review of advanced and tailored applications and challenges. ''Anal. Chim. Acta.'' '''2013''', 765: 28-36.]<br>
 
====Independent component analysis (ICA)====
 
[1] [https://doi.org/10.1016/j.trac.2013.03.013 Rutledge, D. N., Jouan-Rimbaud Bouveresse, D. Independent Components Analysis with the JADE algorithm. ''TrAC-Trend. Anal. Chem.'' '''2013''', 550: 22-32.]<br>
 
[2] [https://www.cs.helsinki.fi/u/ahyvarin/papers/bookfinal_ICA.pdf Hyvaerinen, A., Karhunen, J., Oja, E. Independent Component Analysis, Wiley: New York, 2001, 475 p.]<br>
 
====PARAFAC and TUCKER====
 
[1] [https://www.researchgate.net/publication/223735298_PARAFAC_tutorial_and_applications_Chemom_Intell_Lab_Syst Bro, R. PARAFAC tutorial and applications. ''Chemometr. Intell. Lab.'' '''1997''', 38: 149-171.]<br>
 
[2] Tucker, L. R. The extension of factor analysis to three-dimensional matrices. In ''Contributions to mathematical psychology''; Fredericksen, N., Gulliksen, H., Eds.; Holt, Rinehart & Winston: New York, 1964.<br>
 
====Process analytical technology (PAT)====
 
[1] [https://link.springer.com/article/10.1007%2Fs12010-012-9950-y Challa, S., Potumarthi, R. Chemometrics-based process analytical technology (PAT) tools: applications and adaptation in pharmaceutical and biopharmaceutical industries. ''Appl. Biochem. Biotechnol.'' '''2013''', 169: 66–76.]<br>
 
====Guidelines for NMR validation====
 
[1] [http://www.eurolab.org/documents/NMR%20Val%20Guideline%20II%20V6.pdf Schönberger, T., Monakhova, Y. B., Lachenmeier, D.W., Walch, S., Kuballa, T., et al.'' Guide to NMR Method Development and Validation – Part II: Multivariate data analysis.'' Eurolab Technical Report No. 01/2015.]<br>
 
 
===Reviews on the combination of NMR with chemometrics===
 
[1] [https://doi.org/10.1016/j.sajb.2012.04.001 Heyman, H. M., Meyer, J. J. M. NMR-based metabolomics as a quality control tool for herbal products. ''S. Afr. J. Bot.'' '''2012''', 82: 21-32.]<br>
 
[2] [https://doi.org/10.1016/j.talanta.2014.02.003 Kumar, N., Bansal, A., Sarma, G. S., Rawal, R. K. Chemometrics tools used in analytical chemistry: an overview. ''Talanta'' '''2014''',  123: 186-199.]<br>
 
[3] [http://www.sciencedirect.com/science/article/pii/S007965651100032X?via%3Dihub McKenzie, J. S., Donarski, J. A., Wilson, J. C., Charlton, A. J. Analysis of complex mixtures using high-resolution nuclear magnetic resonance spectroscopy and chemometrics. ''Prog. Nucl. Magn. Reson. Spectrosc.'' '''2011''', 59: 336-359.]<br>
 
[4] [http://www.sciencedirect.com/science/article/pii/S0079656514000600 Brennan, L. NMR-based metabolomics: from sample preparation to applications in nutrition research. ''Prog. Nucl. Magn. Reson. Spectrosc.'' '''2014''', 83: 42-49.]<br>
 
[5] [http://www.sciencedirect.com/science/article/pii/S0079656510000932?via%3Dihub Simpson, A. J., McNally, D. G., Simpson, M. J. NMR spectroscopy in environmental research: from molecular interactions to global processes. ''Prog. Nucl. Magn. Reson. Spectrosc.'' '''2011''', 58: 97-175.]<br>
 
[6] [http://www.sciencedirect.com/science/article/pii/S0731708510007302 Malet-Martino, M., Holzgrabe, U. NMR techniques in biomedical and pharmaceutical analysis. ''J. Pharm. Biomed. Anal.'' '''2011''', 55: 1-15.]<br>
 
[7] [http://pubs.rsc.org/en/content/articlelanding/2002/an/b208254n/unauth#!divAbstract Holmes, E., Antti, H. Chemometric contributions to the evolution of metabonomics: mathematical solutions to characterising and interpreting complex biological NMR spectra. ''Analyst'' '''2002''', 127: 1549-57.]<br>
 
[8] [http://www.sciencedirect.com/science/article/pii/S1043452610590041?via%3Dihub Consonni, R., Cagliani, L. R. Nuclear magnetic resonance and chemometrics to assess geographical origin and quality of traditional food products. ''Adv. Food. Nutr. Res.'' '''2010''', 59: 87-165.]<br>
 
 
===Useful links===
 
====General software====
 
[https://www.mathworks.com/products/matlab.html MATLAB]<br>
 
[https://www.r-project.org/ R computing system]<br>
 
[http://www.camo.com/rt/Products/Unscrambler/unscrambler.html Unscrambler]<br>
 
[http://www.chimiometrie.fr/saisir_webpage.html SAISIR]<br>
 
[http://umetrics.com/products/simca SIMCA-P]<br>
 
[https://www.bruker.com/products/mr/nmr/nmr-software/nmr-software/amix/overview.html AMIX]<br>
 
[https://infometrix.com/pirouette/ Pirouette]<br>
 
[http://ufla.br/chemoface/ Chemoface]<br>
 
 
====Specialized chemometric methods====
 
[http://www.cid.csic.es/homes/rtaqam/ MCR]<br>
 
[https://www.ucl.ac.uk/ion/departments/sobell/Research/RLemon/MILCA/MILCA MILCA and SNICA]<br>
 
[http://read.pudn.com/downloads6/sourcecode/math/22123/image_mva_0/simplisma.m_.htm SIMPLISMA]<br>
 
[http://perso.telecom-paristech.fr/~cardoso/guidesepsou.html JADE]<br>
 
[http://www.cs.umass.edu/~elm/ICA/ RADICAL]<br>
 
[http://www.cis.hut.fi/projects/ica/fastica/ FastICA]<br>
 
 
====Multi-block Toolbox for Matlab====
 
http://www.models.life.ku.dk/~courses/MBtoolbox/mbtmain.htm<br>
 
 
====Public data sets for multivariate data analysis====
 
http://www.models.life.ku.dk/datasets<br>
 
https://www.ucl.ac.uk/ion/departments/sobell/Research/RLemon/MILCA/MILCA<br>
 

Revision as of 15:12, 28 August 2017