At this point, we have grouped the Iris plants into 3 clusters. But suppose we didn’t know there are three species of Iris in the dataset, what is the best number of clusters? And how do we determine that?
Before we answer that, we need to define what is a good cluster?
Good clustering results in tight clusters, meaning that the samples in each cluster are bunched together. How spread out the clusters are is measured by inertia. Inertia is the distance from each sample to the centroid of its cluster. The lower the inertia is, the better our model has done.
You can check the inertia of a model by:
For the Iris dataset, if we graph all the
ks (number of clusters) with their inertias:
Notice how the graph keeps decreasing.
Ultimately, this will always be a trade-off. The goal is to have low inertia and the least number of clusters.
One of the ways to interpret this graph is to use the elbow method: choose an “elbow” in the inertia plot - when inertia begins to decrease more slowly.
In the graph above, 3 is the optimal number of clusters.
First, create two lists:
num_clustersthat has values from 1, 2, 3, … 8
inertiasthat is empty
Then, iterate through
num_clusters and calculate K-means for each number of clusters.
Add each of their inertias into the
plt.plot(num_clusters, inertias, '-o') plt.xlabel('Number of Clusters (k)') plt.ylabel('Inertia') plt.show()