To recap, the Step 1 of the K-Means algorithm is “Place
k random centroids for the initial clusters”.
The K-Means++ algorithm replaces Step 1 of the K-Means algorithm and adds the following:
- 1.1 The first cluster centroid is randomly picked from the data points.
- 1.2 For each remaining data point, the distance from the point to its nearest cluster centroid is calculated.
- 1.3 The next cluster centroid is picked according to a probability proportional to the distance of each point to its nearest cluster centroid. This makes it likely for the next cluster centroid to be far away from the already initialized centroids.
Repeat 1.2 - 1.3 until
k centroids are chosen.
In the web browser you can see the initialization of centroids by regular K-Means (randomly) and by K-Means++.
Notice that the centroids created by K-Means++ are more spaced out.
Make sure to scroll down to see the second graph!