Suppose we have four data samples that form a rectangle whose width is greater than its height:
If you wanted to find two clusters (
k = 2) in the data, which points would you cluster together? You might guess the points that align vertically cluster together, since the height of the rectangle is smaller than its width. We end up with a left cluster (purple points) and a right cluster (yellow points).
Let’s say we use the regular K-Means algorithm to cluster the points, where the cluster centroids are initialized randomly. We get unlucky and those randomly initialized cluster centroids happen to be the midpoints of the top and bottom line segments of the rectangle formed by the four data points.
The algorithm would converge immediately, without moving the cluster centroids. Consequently, the two top data points are clustered together (yellow points) and the two bottom data points are clustered together (purple points).
This is a suboptimal clustering because the width of the rectangle is greater than its height. The optimal clusters would be the two left points as one cluster and the two right points as one cluster, as we thought earlier.
Suppose we have four data samples with these values:
- (1, 1)
- (1, 3)
- (4, 1)
- (4, 3)
And suppose we perform K-means on this data where the
k is 2 and the randomized 2 initial centroids are located at the following positions:
- (2.5, 1)
- (2.5, 3)
What do you think the result clusters would look like?
Run script.py to find out the answer.