Key Concepts

Review core concepts you need to learn to master this subject

Unsupervised Learning Basics

Patterns and structure can be found in unlabeled data using unsupervised learning, an important branch of machine learning. Clustering is the most popular unsupervised learning algorithm; it groups data points into clusters based on their similarity. Because most datasets in the world are unlabeled, unsupervised learning algorithms are very applicable.

Possible applications of clustering include:

  • Search engines: grouping news topics and search results
  • Market segmentation: grouping customers based on geography, demographics, and behaviors

K-Means Algorithm: Intro

K-Means is the most popular clustering algorithm. It uses an iterative technique to group unlabeled data into K clusters based on cluster centers (centroids). The data in each cluster are chosen such that their average distance to their respective centroid is minimized.

  1. Randomly place K centroids for the initial clusters.
  2. Assign each data point to their nearest centroid.
  3. Update centroid locations based on the locations of the data points.

Repeat Steps 2 and 3 until points don’t move between clusters and centroids stabilize.

K-Means Algorithm: 1st Step

The first step of the K-Means clustering algorithm requires placing K random centroids which will become the centers of the K initial clusters. This step can be implemented in Python using the Numpy random.uniform() function; the x and y-coordinates are randomly chosen within the x and y ranges of the data points.

K-Means Algorithm: 2nd Step

After randomly choosing centroid locations for K-Means, each data sample is allocated to its closest centroid to start creating more precise clusters.

The distance between each data sample and every centroid is calculated, the minimum distance is selected, and each data sample is assigned a label that indicates its closest cluster.

The distance formula is implemented as .distance()and used for each data point.

np.argmin() is used to find the minimum distance and find the cluster at that distance.

K-Means Algorithm: 3rd Step

The third step of K-Means updates centroid locations. After the data are assigned to their respectively closest centroid in step 2, each cluster center location is adjusted to be the average of its assigned data points.

The NumPy .mean() function is used to find the average x and y-coordinates of all data points for each cluster and store these as the new centroid locations.

K-Means: Reaching Convergence

In K-Means, after placing K random centroids, the data samples are repeatedly assigned to the nearest centroid and then centroid locations are updated. This continues until each of the centroids’ coordinates converge, or stop changing.

This sequence of events can be implemented in Python using a while loop. The loop continues until the difference between each element of the updated centroids and each element of the past centroids_old is 0. This will mean the centroids have converged and the clusters are complete!

K-Means Using Scikit-Learn

Scikit-Learn, or sklearn, is a machine learning library for Python that has a K-Means algorithm implementation that can be used instead of creating one from scratch.

To use it:

  • Import the KMeans() method from the sklearn.cluster library to build a model with n_clusters

  • Fit the model to the data samples using .fit()

  • Predict the cluster that each data sample belongs to using .predict() and store these as labels

Cross Tabulation Overview

Cross-tabulations involve grouping pieces of data together in order to examine their relationship in a different way. Sometimes correlations within data can be seen better when not just looking at total responses.

This technique is often performed in Python after running K-Means; the Pandas method .crosstab() allows for comparison between resulting cluster labels and user-defined labels for each data sample. In order to validate the results of a K-Means model with this technique, there must be user-defined labels for all data samples.

K-Means: Inertia

Inertia measures how well a dataset was clustered by K-Means. It is calculated by measuring the distance between each data point and its centroid, squaring this distance, and summing these squares across one cluster.

A good model is one with low inertia AND a low number of clusters (K). However, this is a tradeoff because as K increases, inertia decreases.

To find the optimal K for a dataset, use the Elbow method; find the point where the decrease in inertia begins to slow. K=3 is the “elbow” of this graph.

Scikit-Learn Datasets

The scikit-learn library contains built-in datasets in its datasets module that are often used in machine learning problems like classification or regression.

Examples:

  • Iris dataset (classification)
  • Boston house-prices dataset (regression)

The format of these datasets are important to their use with algorithms. For example, each piece of data in the Iris dataset is a sample (flower type), and each element within a sample is a feature (i.e. petal width).

  1. 1
    Often, the data you encounter in the real world won’t have flags attached and won’t provide labeled answers to your question. Finding patterns in this type of data, unlabeled data, is a common them…
  2. 2
    The goal of clustering is to separate data so that data similar to one another are in the same group, while data different from one another are in different groups. So two questions arise: - How m…
  3. 3
    Before we implement the K-means algorithm, let’s find a dataset. The sklearn package embeds some datasets and sample images. One of them is the Iris dataset . The Iris dataset consists of measure…
  4. 4
    To get a better sense of the data in the iris.data matrix, let’s visualize it! With Matplotlib, we can create a 2D scatter plot of the Iris dataset using two of its features (sepal length vs. peta…
  5. 5
    The K-Means algorithm: 1. Place k random centroids for the initial clusters. 2. Assign data samples to the nearest centroid. 3. Update centroids based on the above-assigned data samples. Repe…
  6. 6
    The K-Means algorithm: 1. Place k random centroids for the initial clusters. 2. Assign data samples to the nearest centroid. 3. Update centroids based on the above-assigned data samples. Repe…
  7. 7
    The K-Means algorithm: 1. Place k random centroids for the initial clusters. 2. Assign data samples to the nearest centroid. 3. Update centroids based on the above-assigned data samples. Repe…
  8. 8
    The K-Means algorithm: 1. Place k random centroids for the initial clusters. 2. Assign data samples to the nearest centroid. 3. Update centroids based on the above-assigned data samples. **Repeat…
  9. 9
    Awesome, you have implemented K-Means clustering from scratch! Writing an algorithm whenever you need it can be very time-consuming and you might make mistakes and typos along the way. We will no…
  10. 10
    You used K-Means and found three clusters of the samples data. But it gets cooler! Since you have created a model that computed K-Means clustering, you can now feed new data samples into it and …
  11. 11
    We have done the following using sklearn library: - Load the embedded dataset - Compute K-Means on the dataset (where k is 3) - Predict the labels of the data samples And the labels resulted in e…
  12. 12
    At this point, we have clustered the Iris data into 3 different groups (implemented using Python and using scikit-learn). But do the clusters correspond to the actual species? Let’s find out! Firs…
  13. 13
    At this point, we have grouped the Iris plants into 3 clusters. But suppose we didn’t know there are three species of Iris in the dataset, what is the best number of clusters? And how do we determi…
  14. 14
    Now it is your turn! In this review section, find another dataset from one of the following: - The scikit-learn library - UCI Machine Learning Repo - Codecademy GitHub Repo (coming soon!) …
  1. 1
    The K-Means clustering algorithm is more than half a century old, but it is not falling out of fashion; it is still the most popular clustering algorithm for Machine Learning. However, there can b…
  2. 2
    Suppose we have four data samples that form a rectangle whose width is greater than its height: If you wanted to find two clusters (k = 2) in the data, which points would you cluster together? …
  3. 3
    To recap, the Step 1 of the K-Means algorithm is “Place k random centroids for the initial clusters”. The K-Means++ algorithm replaces Step 1 of the K-Means algorithm and adds the following: - **…
  4. 4
    Using the scikit-learn library and its cluster module , you can use the KMeans() method to build an original K-Means model that finds 6 clusters like so: model = KMeans(n_clusters=6, init=’rando…
  5. 5
    Congratulations, now your K-Means model is improved and ready to go! K-Means++ improves K-Means by placing initial centroids more strategically. As a result, it can result in more optimal clusteri…

What you'll create

Portfolio projects that showcase your new skills

Pro Logo

How you'll master it

Stress-test your knowledge with quizzes that help commit syntax to memory

Pro Logo