In this article, which extends the piece in our newsletter, we explore what Principal Components Analysis (PCA) is, where it is useful, address some key points to be aware of in usage, and describe how to undertake PCA in p:IGI+.
While not to be considered training in the use of PCA in p:IGI+, this is a cut down version of some of the content provided as part of our online (and when allowed, in person) p:IGI+ training courses.
What is PCA?
PCA is a dimension reduction technique – it allows you to determine ‘principal axes’ in your data space that best explain the variation across many properties in as small a number of new properties (which we will call Principal Components, or PCs) as possible. PCA computes the new axes as linear combinations of the variables (properties) you define as inputs. Mathematically it is based on an eigen-decomposition of the covariance matrix of your data, although in practice more robust numerical methods are often used (for those interested, p:IGI+ uses a singular value decomposition for numerical stability).
The essence of the idea behind PCA is captured in the sketch to the left. PCA works especially well for correlated data. In the sketch the original properties 1 and 2 are quite strongly correlated, so the data (shown in blue) falls in a region of the space that has a roughly elliptical shape. When you subtract the mean from the data (often called centring, which you should always do when using PCA) you can summarise the data in terms of the covariance between the properties – this is sketched as the green ellipse. The PCs are then the principal axes of the ellipse and are chosen so that the first principal component explains the most variance (variability) in the data set. In the 2-dimensional example above the second (and last) principal component is orthogonal (‘at right angles’) to the first and explains the remaining variance. In this example it might be that you only need to retain the first principal component to explain most of the variability in the data – that is the ‘signal’, and you can ignore the second which might be considered to be largely ‘noise’.
One thing to be aware of is that the ‘signal’ of interest might be present in smaller variance explaining components – there is no guarantee that the first two, or even few, PCs contain the signal of interest.
When to use PCA
PCA can be used in any setting where you want to reduce the numbers of properties (columns) that need to be considered in visualisation or modelling. It is a data driven method, so the results are not typically helpful when using physically based models. However, to help visualise the data and summarise a large number of properties it is robust and reproducible (the same inputs and pre-processing will always produce the same results) so it should be your default dimension reduction method.
The most widespread use of PCA in oil and gas geochemistry is to summarise molecular composition data from gas chromatography, for example as described in Wang et al (2018) which reviews several papers using PCA (often with some form of clustering) to explore oil-oil and oil-source rock correlations. Commonly authors employ many well-known ratios of compounds as inputs to PCA, although raw peak measurements can also be used with appropriate pre-processing. The PCs can then be plotted to allow 2D visualisation. This is often used to ascribe ‘oil families’ to clusters of points, of using labelled examples. The calculated PCs can also be used as inputs to further modelling such as clustering if desired.
PCA is not only relevant to GC or GCMS data; it can also be applied to any sort of data, for example it could be used to characterise inorganic composition of rock from an X-ray fluorescence analysis. With care it can be used to combine a range of variables (but take care of different natural scales of variation) to compute composite summaries.
PCA in p:IGI+
The first thing we need to do when creating a PCA model is to define the ‘training set’. This is the subset of data you use to learn about the model parameters. It has two aspects, the set of properties you want to use (represented by a page in p:IGI+) and the set of samples you want use (represented by a sample set in p:IGI+). Using a page allows you to see your data before running the PCA model.
A key consideration is the proportion of missing values in your data set. It is a good idea to remove properties (columns) with > 20% missing values and remove samples (rows) with more than 20% missing values as having too many missing values will affect your model.
Once your page and sample set are created, just select Create new PCA… from the right click menu on the page. This brings up the create PCA dialogue, with your page and sample set pre-selected. You need to choose what to do with the missing values – the most robust choice, if you have sufficient samples, is to remove any samples with missing data. You can then build the training set – this will tell you how many properties and samples and you have.
Ideally if you have m properties, you would like at least m2 samples to have a reasonable estimate of the sample covariance matrix. If you have a lot fewer than m samples your covariance matrix will be ‘rank deficient’ and you will find only the first m components have non-zero variance – this is telling you about the number of samples and not any structure in your data!
In p:IGI+ you give your PCA model a name because the PCA model can be stored in the project, and indeed applied in other projects, by saving it as a template. Before you train your model, you need to select how to pre-process the observations.
If you are using raw GC data, for example heights or areas from a whole oil GC analysis, you should ideally normalise the data (so each sample is measured on a consistent scale). This is really the only time to apply normalisation.
When you are using data measured on different scales, for example reported in different units it is typically a good idea to standardise the data which scales each property to have zero mean and unit variance.
This means you are dealing with the correlation matrix, rather than the covariance matrix, allowing you to compare data measured across different scales with equal weight.
Once you have decided on the pre-processing you can train the model. Once trained you can inspect the eigen-values which are shown in the form of the percentage of the total data variance explained on the ‘scree plot’ as shown above. There are a few things to look out for here.
If only a few PCs explain almost all the variance make sure your data is measured on the same scales (or you have standardised). If all components explain a similar proportion of variance your properties are probably uncorrelated. You often look for a break in slope of the proportion of variance explained to determine the number of PCs to calculate.
p:IGI+ will also provide you with a plot of the component loadings (shown above) – that is the weight of each property on each PC in the model. It can be tempting to over-interpret these and imagine the PCs explain physical processes, but this is rarely the case – they just capture linear projections of the data that explain the most variability in the data set, and sometimes these equate to processes.
Once useful use of the loadings plot is to see if there are properties that could be excluded as they are giving essentially the same information. These will plot together on the loadings plot, although it is important to explore whether this is true for all the leading PCs.
Finally once you are happy with the setup of the model you can actually create the model. This is stored with the project, can be exported as a template, and will also create new project properties which are the PCs – you can choose how many to create. These PCs, the projection of your data onto the directions of maximum variance will be defined as equations which will auto-compute for all samples with all inputs having data. They can be use like any other property in the system – plotted on graphs, maps and used in palettes.
So yes, you probably do need PCA at times, but remember PCA is not magic and needs to be considered like any other model. Here are some things to remember when using PCA:
- PCA is a linear method, so if your data lies on a non-linear (curved, not planar) subspace, PCA might not provide a very good summary of your data. However, being a linear method, it is relatively robust.
- PCA uses the covariance (or correlation) matrix to calculate the directions of largest variability – this makes it rather sensitive to outliers (and a good way to find these in high dimensional data).
- When using PCA with lots of inputs, you will need lots of training data to provide a good estimate of the covariance matrix.
- The projection of your input data to your principal components can be re-used in other projects by saving the model as a template.
We hope you enjoy and benefit from using PCA, integrated into version 220.127.116.11 of p:IGI+.
Wang, Y-P, Zou, Y-R, Shi, J-T and Shi, J, 2018. Review of the chemometrics application in oil-oil and oil-source rock correlations, Journal of Natural Gas Geoscience, 3, 217-232. https://doi.org/10.1016/j.jnggs.2018.08.003