Codecademy Logo

Unsupervised Learning Interview Question

K-Means: Inertia

Inertia measures how well a dataset was clustered by K-Means. It is calculated by measuring the distance between each data point and its centroid, squaring this distance, and summing these squares across one cluster.

A good model is one with low inertia AND a low number of clusters (K). However, this is a tradeoff because as K increases, inertia decreases.

To find the optimal K for a dataset, use the Elbow method; find the point where the decrease in inertia begins to slow. K=3 is the “elbow” of this graph.

A line graph plotting the optimal number of clusters.

 At the top of the graph there is a label 'Optimal Number of Clusters'. The x-axis of the graph is labeled 'Number of Clusters (k)'. The y-axis of the graph is labeled 'Inertia'. The x-axis ranges from 1 to 7 with increments of 1 marked with the numbers 1 through 7. The y-axis ranges from 100 to 700 and is marked at increments of 100.

The first plot point number of clusters is 1 and inertia is about 650. When the number of clusters is 2 the inertia is about 225. When the number of clusters is 3 the inertia is about 150. When the number of clusters is 4 the inertia is about 130. When the number of clusters is 5 the inertia is about 115. When the number of clusters is 6 the inertia is about 105. Finally when the number of clusters is 7 the inertia is about 100.

Unsupervised Learning Basics

Patterns and structure can be found in unlabeled data using unsupervised learning, an important branch of machine learning. Clustering is the most popular unsupervised learning algorithm; it groups data points into clusters based on their similarity. Because most datasets in the world are unlabeled, unsupervised learning algorithms are very applicable.

Possible applications of clustering include:

  • Search engines: grouping news topics and search results
  • Market segmentation: grouping customers based on geography, demographics, and behaviors

K-Means Algorithm: Intro

K-Means is the most popular clustering algorithm. It uses an iterative technique to group unlabeled data into K clusters based on cluster centers (centroids). The data in each cluster are chosen such that their average distance to their respective centroid is minimized.

  1. Randomly place K centroids for the initial clusters.
  2. Assign each data point to their nearest centroid.
  3. Update centroid locations based on the locations of the data points.

Repeat Steps 2 and 3 until points don’t move between clusters and centroids stabilize.

An animated gif of how the k- means algorithm works. 

To start there is a graph with the numbers 0 through 8 on the x-axis and the numbers 0 through 8 on the y-axis. There are many points plotted and each of the points starts out as blue.

Next centroids are placed. A green centroid in the shape of the letter 'X' is placed at an x value of 1 and a y value of 6. A red centroid in the shape of the letter 'X' is placed at an x value of about 5 and a y value of 3. The data points are then assigned to their nearest centroid and color coded as such. 

The centroids are updated based on the new data points that were assigned to the centroids. The green centroid moves to an x value of 2 and a y value of 5. The red centroid moves to an x value of 3.5 and a y value of about 2.5. Again, the data points are reassigned based on the closest centroid and then the centroids are updated based on their newly assigned data points. The centroids move for the third time, the green centroid  is at an x value of 3.5 and a y value of 5.5. the third position of the red centroid is an x value of 3 and a y value of 1.5. The data samples are assigned to the nearest centroid again and the centroids are updated based on the newly assigned data samples. The centroids move to their final positions. The green centroid is at an x value of 4.5 and a y value of 6.5. The red centroid is at an x value of 2 and a y value of 1.5.

K-Means Using Scikit-Learn

Scikit-Learn, or sklearn, is a machine learning library for Python that has a K-Means algorithm implementation that can be used instead of creating one from scratch.

To use it:

  • Import the KMeans() method from the sklearn.cluster library to build a model with n_clusters

  • Fit the model to the data samples using .fit()

  • Predict the cluster that each data sample belongs to using .predict() and store these as labels

from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(data_samples)
labels = model.predict(data_samples)

Inspecting Variable Types

One of the most important first steps when working with a dataset is to inspect the variable types, and identify relevant variables. An efficient method to use when inspecting variables is the .head() method which will return the first rows of a dataset.

print(df.head())

One-Hot Encoding with Python

When working with nominal categorical variables in Python, it can be useful to use One-Hot Encoding, which is a technique that will effectively create binary variables for each of the nominal categories. This encodes the variable without creating an order among the categories. To one-hot encode a variable in a pandas dataframe, we can use the .get_dummies().

df = pd.get_dummies(data = df, columns= ['column1', 'column2')

Learn More on Codecademy