Unsupervised machine learning techniques, such as clustering algorithms, are useful tools for exploratory analysis. These techniques “learn” patterns from untagged data, or data that do not have classifications already attached to them, and they help us see relationships between many variables at once.

Let’s see how cluster analysis works with penguin data that scientists collected in Antarctica! Take a look at the data in the learning environment, which comes from the palmer penguins dataset.

  • We have four measurements (weight, beak length, beak depth, and flipper length) from Gentoo, Adelie, and Chinstrap penguins living on three islands (penguin artwork by @allison_horst). It’s pretty difficult to see relationships between the measurements looking at the raw data.

  • Ideally, we want to see how all the measurements relate to each other, but plotting them all would be a lot to look at. Instead, we can perform a Principal Component Analysis or PCA, which compresses the variables into principal components that can be plotted against each other. After PCA, we can use k-means clustering to look for trends in the data. We see that the penguins fall into three distinct clusters in the PCA plot!

  • It’s cool that our analysis shows three clusters because we know that our data include three penguin species living on three islands. We can use the Rand statistic to see how well species and islands match the k-means clusters. This is similar to the R-squared statistic we use to check the fit of a trend line. With a Rand score of 0.97 out of 1.0, species have a stronger relationship to clustering!

Next up, we’ll check out inferential analysis!

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?