Before we implement the K-means algorithm, let’s find a dataset. The sklearn
package embeds some datasets and sample images. One of them is the Iris dataset.
The Iris dataset consists of measurements of sepals and petals of 3 different plant species:
- Iris setosa
- Iris versicolor
- Iris virginica
The sepal is the part that encases and protects the flower when it is in the bud stage. A petal is a leaflike part that is often colorful.
From sklearn
library, import the datasets
module:
from sklearn import datasets
To load the Iris dataset:
iris = datasets.load_iris()
The Iris dataset looks like:
[[ 5.1 3.5 1.4 0.2 ] [ 4.9 3. 1.4 0.2 ] [ 4.7 3.2 1.3 0.2 ] [ 4.6 3.1 1.5 0.2 ] . . . [ 5.9 3. 5.1 1.8 ]]
We call each piece of data a sample. For example, each flower is one sample.
Each characteristic we are interested in is a feature. For example, petal length is a feature of this dataset.
The features of the dataset are:
- Column 0: Sepal length
- Column 1: Sepal width
- Column 2: Petal length
- Column 3: Petal width
The 3 species of Iris plants are what we are going to cluster later in this lesson.
Instructions
Import the datasets
module and load the Iris data.
Every dataset from sklearn
comes with a bunch of different information (not just the data) and is stored in a similar fashion.
First, let’s take a look at the most important thing, the sample data:
print(iris.data)
Each row is a plant!
Since the datasets in sklearn
datasets are used for practice, they come with the answers (target values) in the target
key:
Take a look at the target values:
print(iris.target)
The iris.target
values give the ground truth for the Iris dataset. Ground truth, in this case, is the number corresponding to the flower that we are trying to learn.
It is always a good idea to read the descriptions of the data:
print(iris.DESCR)
Expand the terminal (right panel):
- When was the Iris dataset published?
- What is the unit of measurement?