K-Nearest Neighbors (KNN) is a classification algorithm. The central idea is that data points with similar attributes tend to fall into similar categories.
Consider the image to the right. This image is complicated, but for now, let’s just focus on where the data points are being placed. Every data point — whether its color is red, green, or white — has an
x value and a
y value. As a result, it can be plotted on this two-dimensional graph.
Next, let’s consider the color of the data. The color represents the class that the K-Nearest Neighbor algorithm is trying to classify. In this image, data points can either have the class
green or the class
red. If a data point is white, this means that it doesn’t have a class yet. The purpose of the algorithm is to classify these unknown points.
Finally, consider the expanding circle around the white point. This circle is finding the
k nearest neighbors to the white point. When
k = 3, the circle is fairly small. Two of the three nearest neighbors are green, and one is red. So in this case, the algorithm would classify the white point as green. However, when we increase
5, the circle expands, and the classification changes. Three of the nearest neighbors are red and two are green, so now the white point will be classified as red.
This is the central idea behind the K-Nearest Neighbor algorithm. If you have a dataset of points where the class of each point is known, you can take a new point with an unknown class, find it’s nearest neighbors, and classify it.
Before moving on to the next exercise, consider the image below:
k = 1, what would the class of the question mark be?
k = 5, what would it be?
Note that rather than using colors, in this image, the class is denoted by the shape of each point.