Principal Component Analysis

Codecademy Team
An introduction to principal component analysis (PCA) and example implementation in Python

What is Principal Component Analysis?

Principal Component Analysis (PCA) is a dimension reduction method that is frequently used in exploratory data analysis and machine learning. This means that PCA can be leveraged to reduce the number of variables (dimensions) in a dataset without losing too much information.

Why use PCA?

PCA is useful because large datasets with many variables can pose challenges. For example, many machine learning algorithms are computationally intensive, and become slow when datasets are too large. From a data exploration perspective, it is also difficult to visualize a lot of variables at the same time.

Let’s consider the following dataset, which includes four different measurements for 342 penguins. Suppose that we want to see whether we can categorize the penguins into groups (possibly species!) based on these measurements. The first few rows of the dataset look like this:

penguins_head

One way to visually look for clusters is to check for individual variables with multi-modal distributions (histograms with multiple “humps”) or plot variables against each other and see whether groups emerge. The seaborn pairplot function allows us to create all single-variable histograms and pairwise plots in a single line of code:

sns.pairplot(penguins)

penguins_pairplot

Examining these plots, we see some evidence of clusters, but it’s hard to be sure how many there are. Imagine a 10-by-10 grid for a larger variable set — would we have any hope of drawing conclusions? In some plots, there seem to be no groups; in others, there seem to be two or three groups. It would be nice if we could combine variables so that there are fewer plots and so that each one describes the relationship between all four variables instead of just two. PCA allows us to do just that!

How is PCA implemented?

PCA involves the creation of a new set of variables (called principal components) from linear combinations of existing ones. The trick is that PCA leverages techniques from linear algebra to define the principal components in a way that describes as much variation as possible from the original data with fewer new variables. This happens sequentially: the first principal component describes as much variation as possible; the second describes as much of the left-over variation as possible (with the constraint that it must be uncorrelated with the first principal component); and so on. At maximum, there can be as many principal components as there are original variables, but we can choose to use fewer of them if we see a large percentage of variation captured in some smaller set.

Because principal components are meant to capture variation among existing variables, it is generally best practice to standardize all variables before implementing PCA so that everything is on the same scale. For example, you may have noticed that the body_mass_g column in the penguins dataset contains much larger numbers than the other columns. A difference of 50 grams is very small with respect to body mass — but a 50 mm difference in flipper length is huge! In order to avoid over-weighting the variation in body_mass_g simply because of the large magnitude of values in that column, standardization can be used to put everything on the same scale.

After data pre-processing is complete, we can implement PCA. To start, it often makes sense to calculate all of the principal components and then decide which ones to keep. In this case, we can calculate a maximum of four principal components because there are four variables in the penguins dataset. The code below uses the PCA function from sklearn.decomposition to inspect the principal components:

from sklearn.decomposition import PCA
penguins_pca = PCA(n_components = 4)
components = penguins_pca.fit(penguins).components_
components = pd.DataFrame(components).transpose()
components.columns = ['Comp1', 'Comp2', 'Comp3', 'Comp4']
components.index = penguins.columns
print(components)

components

This output tells us that the first principal component, for example, can be calculated as 0.455*bill_length_mm - 0.400*bill_depth_mm + 0.576*flipper_length_mm + 0.548*body_mass_g. Looking at the principal components themselves can tell us which variables are most responsible for variation in the data. For example, flipper length is weighted most highly in the first component; however, bill depth is weighted most highly in the second.

How do data scientists decide which components to use?

In order to decide which principal components to use in an analysis, we can look at the proportion of variance explained by each one:

var_ratio = penguins_pca.explained_variance_ratio_
var_ratio= pd.DataFrame(var_ratio).transpose()
var_ratio.columns = ['Comp1', 'Comp2', 'Comp3', 'Comp4']
var_ratio.index = ['Proportion of Variance']
print(var_ratio)

variances

This tells us that the first principal component accounts for about 69% of the variation in the data; the second accounts for about 19%; and so on. The first two components combined account for 69+19 = 88% of the overall variation, while the first three account for 69+19+9 = 97%. Based on this information, a data scientist would need to make a judgement about how many principal components to keep. At the very least, it seems reasonable to get rid of component 4 and retain only 3 variables that still describe 97% of the variation in the data.

In practice, data scientists are often working with much larger datasets. In that case, it can be useful to visually inspect the amount of variation explained by each component using a scree plot. For example, suppose we have generated 10 principal components from a 10-variable dataset. We can inspect a plot of the proportion of variance explained for the ten sequential components:

scree

A common technique is to look for an “elbow” in the plot where the proportion of variance starts leveling off. In this example, the elbow starts at the second component, so it would be reasonable to only keep two components (or possibly three to be conservative). After that, the amount of information gained from each additional variable is probably not worth the extra complexity.

How can principal components be used to transform data?

After choosing a number of components to keep, we can transform the original data using these principal components and save our new variables. The following code saves the first three components in a data frame called penguins_pcomp:

penguins_pcomp = penguins_pca.fit_transform(penguins)
penguins_pcomp = pd.DataFrame(penguins_pcomp)
penguins_pcomp = penguins_pcomp.iloc[:,0:3]
penguins_pcomp.columns = ['Comp1', 'Comp2', 'Comp3']
print(penguins_pcomp.head())

components_head

When we inspect this new dataset, we see that there are now three new variables. Unfortunately, they are less interpretable than the original variables (we probably do not have intuition about what a value of -1.8 for component 1 really means), but they contain 97% of the same information as the original data in three quarters of the space.

How is the transformed data used?

Finally, we can use these new variables for modeling and data exploration. For example, we can now use unsupervised machine learning techniques to find clusters of penguins based on the principal components. We can also use the principal components as the independent variable set for any supervised machine learning model, although this is done sparingly in practice because of interpretability issues.

Principal components are also useful for further data exploration and visualization. For example, we can plot the first three principal components and see what new patterns emerge:

sns.pairplot(penguins_pcomp)

princomp_paired

Based on this new set of plots, we might conclude that there are two clear groups of penguins in this dataset; however, the third principal component (which weights bill length most highly) is helpful in discerning a possible third group of penguins. Our transformed dataset has illuminated new avenues for exploration!

References

Data downloaded from:

Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.

Data originally published in:

Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081

Data citations:

Adélie penguins:

Palmer Station Antarctica LTER and K. Gorman, 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Adélie penguins (Pygoscelis adeliae) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 5. Environmental Data Initiative. https://doi.org/10.6073/pasta/98b16d7d563f265cb52372c8ca99e60f (Accessed 2020-06-08).

Gentoo penguins:

Palmer Station Antarctica LTER and K. Gorman, 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Gentoo penguin (Pygoscelis papua) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 5. Environmental Data Initiative. https://doi.org/10.6073/pasta/7fca67fb28d56ee2ffa3d9370ebda689 (Accessed 2020-06-08).

Chinstrap penguins:

Palmer Station Antarctica LTER and K. Gorman, 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Chinstrap penguin (Pygoscelis antarcticus) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 6. Environmental Data Initiative. https://doi.org/10.6073/pasta/c14dfcfada8ea13a17536e73eb6fbe9e (Accessed 2020-06-08).