Learn

The goal of clustering is to separate data so that data similar to one another are in the same group, while data different from one another are in different groups. So two questions arise:

  • How many groups do we choose?
  • How do we define similarity?

K-Means is the most popular and well-known clustering algorithm, and it tries to address these two questions.

  • The “K” refers to the number of clusters (groups) we expect to find in a dataset.
  • The “Means” refers to the average distance of data to each cluster center, also known as the centroid, which we are trying to minimize.

It is an iterative approach:

  1. Place k random centroids for the initial clusters.
  2. Assign data samples to the nearest centroid.
  3. Update centroids based on the above-assigned data samples.

Repeat Steps 2 and 3 until convergence (when points don’t move between clusters and centroids stabilize).

Once we are happy with our clusters, we can take a new unlabeled datapoint and quickly assign it to the appropriate cluster.

Instructions

In this lesson, we will first implement K-Means the hard way (to help you understand the algorithm) and then the easy way using the sklearn library!

Sign up to start coding

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?