At this point, we have clustered the Iris data into 3 different groups (implemented using Python and using scikit-learn). But do the clusters correspond to the actual species? Let’s find out!
First, remember that the Iris dataset comes with target values:
target = iris.target
It looks like:
[ 0 0 0 0 0 ... 2 2 2]
According to the metadata:
- All the
0
‘s are Iris-setosa - All the
1
‘s are Iris-versicolor - All the
2
‘s are Iris-virginica
Let’s change these values into the corresponding species using the following code:
species = np.chararray(target.shape, itemsize=150) for i in range(len(samples)): if target[i] == 0: species[i] = 'setosa' elif target[i] == 1: species[i] = 'versicolor' elif target[i] == 2: species[i] = 'virginica'
Then we are going to use the Pandas library to perform a cross-tabulation.
Cross-tabulations enable you to examine relationships within the data that might not be readily apparent when analyzing total survey responses.
The result should look something like:
labels setosa versicolor virginica 0 50 0 0 1 0 2 36 2 0 48 14
(You might need to expand this narrative panel in order to the read the table better.)
The first column has the cluster labels. The second to fourth columns have the Iris species that are clustered into each of the labels.
By looking at this, you can conclude that:
- Iris-setosa was clustered with 100% accuracy.
- Iris-versicolor was clustered with 96% accuracy.
- Iris-virginica didn’t do so well.
Follow the instructions below to learn how to do a cross-tabulation.
Instructions
pandas
is already imported for you:
import pandas as pd
Add the code from the narrative to get the species
array and finish the elif
statements:
species = np.chararray(target.shape, itemsize=150) for i in range(len(samples)): if target[i] == 0: species[i] = 'setosa' # finish elif # finish elif
Then, below the for
loop, create:
df = pd.DataFrame({'labels': labels, 'species': species}) print(df)
Next, use the crosstab()
method to perform cross-tabulation:
ct = pd.crosstab(df['labels'], df['species']) print(ct)
Expand the right panel (output terminal).
How accurate are the clusters?