Before we implement the K-means algorithm, let’s find a dataset. The
sklearn package embeds some datasets and sample images. One of them is the Iris dataset.
The Iris dataset consists of measurements of sepals and petals of 3 different plant species:
- Iris setosa
- Iris versicolor
- Iris virginica
The sepal is the part that encases and protects the flower when it is in the bud stage. A petal is a leaflike part that is often colorful.
sklearn library, import the
from sklearn import datasets
To load the Iris dataset:
iris = datasets.load_iris()
The Iris dataset looks like:
[[ 5.1 3.5 1.4 0.2 ] [ 4.9 3. 1.4 0.2 ] [ 4.7 3.2 1.3 0.2 ] [ 4.6 3.1 1.5 0.2 ] . . . [ 5.9 3. 5.1 1.8 ]]
We call each piece of data a sample. For example, each flower is one sample.
Each characteristic we are interested in is a feature. For example, petal length is a feature of this dataset.
The features of the dataset are:
- Column 0: Sepal length
- Column 1: Sepal width
- Column 2: Petal length
- Column 3: Petal width
The 3 species of Iris plants are what we are going to cluster later in this lesson.
datasets module and load the Iris data.
Every dataset from
sklearn comes with a bunch of different information (not just the data) and is stored in a similar fashion.
First, let’s take a look at the most important thing, the sample data:
Each row is a plant!
Since the datasets in
sklearn datasets are used for practice, they come with the answers (target values) in the
Take a look at the target values:
iris.target values give the ground truth for the Iris dataset. Ground truth, in this case, is the number corresponding to the flower that we are trying to learn.
It is always a good idea to read the descriptions of the data:
Expand the terminal (right panel):
- When was the Iris dataset published?
- What is the unit of measurement?