At this point, we have grouped the Iris plants into 3 clusters. But suppose we didn’t know there are three species of Iris in the dataset, what is the best number of clusters? And how do we determine that?
Before we answer that, we need to define what is a good cluster?
Good clustering results in tight clusters, meaning that the samples in each cluster are bunched together. How spread out the clusters are is measured by inertia. Inertia is the distance from each sample to the centroid of its cluster. The lower the inertia is, the better our model has done.
You can check the inertia of a model by:
print(model.inertia_)
For the Iris dataset, if we graph all the k
s (number of clusters) with their inertias:
Notice how the graph keeps decreasing.
Ultimately, this will always be a trade-off. The goal is to have low inertia and the least number of clusters.
One of the ways to interpret this graph is to use the elbow method: choose an “elbow” in the inertia plot - when inertia begins to decrease more slowly.
In the graph above, 3 is the optimal number of clusters.
Instructions
First, create two lists:
num_clusters
that has values from 1, 2, 3, … 8inertias
that is empty
Then, iterate through num_clusters
and calculate K-means for each number of clusters.
Add each of their inertias into the inertias
list.
Plot the inertias
vs num_clusters
:
plt.plot(num_clusters, inertias, '-o') plt.xlabel('Number of Clusters (k)') plt.ylabel('Inertia') plt.show()