The goal of clustering is to separate data so that data similar to one another are in the same group, while data different from one another are in different groups. So two questions arise:
- How many groups do we choose?
- How do we define similarity?
K-Means is the most popular and well-known clustering algorithm, and it tries to address these two questions.
- The “K” refers to the number of clusters (groups) we expect to find in a dataset.
- The “Means” refers to the average distance of data to each cluster center, also known as the centroid, which we are trying to minimize.
It is an iterative approach:
krandom centroids for the initial clusters.
- Assign data samples to the nearest centroid.
- Update centroids based on the above-assigned data samples.
Repeat Steps 2 and 3 until convergence (when points don’t move between clusters and centroids stabilize).
Once we are happy with our clusters, we can take a new unlabeled datapoint and quickly assign it to the appropriate cluster.
In this lesson, we will first implement K-Means the hard way (to help you understand the algorithm) and then the easy way using the