# Classification: K-Nearest Neighbors

K-Nearest Neighbors is a supervised machine learning algorithm for classification. You will implement and test this algorithm on several datasets.

Start## Key Concepts

Review core concepts you need to learn to master this subject

K-Nearest Neighbors

KNN of Unknown Data Point

Elbow Curve Validation Technique in K-Nearest Neighbor Algorithm

K-Nearest Neighbors Underfitting and Overfitting

KNN Classification Algorithm in Scikit Learn

Euclidean Distance

Normalizing Data

K-Nearest Neighbors

K-Nearest Neighbors

The K-Nearest Neighbors algorithm is a supervised machine learning algorithm for labeling an unknown data point given existing labeled data.

The nearness of points is typically determined by using distance algorithms such as the Euclidean distance formula based on parameters of the data. The algorithm will classify a point based on the labels of the K nearest neighbor points, where the value of K can be specified.

KNN of Unknown Data Point

KNN of Unknown Data Point

To classify the unknown data point using the KNN (K-Nearest Neighbor) algorithm:

- Normalize the numeric data
- Find the distance between the unknown data point and all training data points
- Sort the distance and find the nearest k data points
- Classify the unknown data point based on the most instances of nearest k points

Elbow Curve Validation Technique in K-Nearest Neighbor Algorithm

Elbow Curve Validation Technique in K-Nearest Neighbor Algorithm

Choosing an optimal `k`

value in KNN determines the number of neighbors we look at when we assign a value to any new observation.

For a very low value of `k`

(suppose k=1), the model overfits on the training data, which leads to a high error rate on the validation set. On the other hand, for a high value of `k`

, the model performs poorly on both train and validation set. When `k`

increases, validation error decreases and then starts increasing in a “U” shape. An optimal value of `k`

can be determined from the elbow curve of the validation error.

K-Nearest Neighbors Underfitting and Overfitting

K-Nearest Neighbors Underfitting and Overfitting

The value of k in the KNN algorithm is related to the error rate of the model. A small value of k could lead to overfitting as well as a big value of k can lead to underfitting. Overfitting imply that the model is well on the training data but has poor performance when new data is coming. Underfitting refers to a model that is not good on the training data and also cannot be generalized to predict new data.

KNN Classification Algorithm in Scikit Learn

KNN Classification Algorithm in Scikit Learn

Scikit-learn is a very popular Machine Learning library in Python which provides a `KNeighborsClassifier`

object which performs the KNN classification. The `n_neighbors`

parameter passed to the `KNeighborsClassifier`

object sets the desired `k`

value that checks the `k`

closest neighbors for each unclassified point.

The object provides a `.fit()`

method which takes in training data and a `.predict()`

method which returns the classification of a set of data points.

Euclidean Distance

Euclidean Distance

The *Euclidean Distance* between two points can be computed, knowing the coordinates of those points.

On a 2-D plane, the distance between two points **p** and **q** is the square-root of the sum of the squares of the difference between their **x** and **y** components. Remember the Pythagorean Theorem: **a^2 + b^2 = c^2** ?

We can write a function to compute this distance. Let’s assume that points are represented by tuples of the form `(x_coord, y_coord)`

. Also remember that computing the square-root of some value **n** can be done in a couple of ways: `math.sqrt(n)`

, using the math library, or `n ** 0.5`

(**n** raised to the power of 1/2).

```
def distance(p1, p2):
x_diff_squared = (p1[0] - p2[0]) ** 2
y_diff_squared = (p1[1] - p2[1]) ** 2
return (x_diff_squared + y_diff_squared) ** 0.5
distance( (0, 0), (3, 4) ) # => 5.0
```

Normalizing Data

Normalizing Data

Normalization is a process of converting the numeric columns in the dataset to a common scale while retaining the underlying differences in the range of values.

For example, Min-max normalization converts each value of the numeric column to a value between 0 and 1 using the formula
`Normalized value = (NumericValue - MinValue) / (MaxValue - MinValue)`

. A downside of Min-max Normalization is that it does not handle outliers very well.

- 1In this lesson, you will learn three different ways to define the distance between two points: 1. Euclidean Distance 2. Manhattan Distance 3. Hamming Distance Before diving into the distance form…
- 2
**Euclidean Distance**is the most commonly used distance formula. To find the Euclidean distance between two points, we first calculate the squared distance between each dimension. If we add up al… - 3
**Manhattan Distance**is extremely similar to Euclidean distance. Rather than summing the squared difference between each dimension, we instead sum the absolute value of the difference between eac… - 4
**Hamming Distance**is another slightly different variation on the distance formula. Instead of finding the difference of each dimension, Hamming distance only cares about whether the dimensions a… - 5Now that you’ve written these three distance formulas yourself, let’s look at how to use them using Python’s SciPy library: - Euclidean Distance .euclidean() - Manhattan Distance .cityblock() - …

- 1
**K-Nearest Neighbors (KNN)**is a classification algorithm. The central idea is that data points with similar attributes tend to fall into similar categories. Consider the image to the right. Th… - 2Before diving into the K-Nearest Neighbors algorithm, let’s first take a minute to think about an example. Consider a dataset of movies. Let’s brainstorm some features of a movie data point. A fe…
- 3In the first exercise, we were able to visualize the dataset and estimate the k nearest neighbors of an unknown point. But a computer isn’t going to be able to do that! We need to define what it m…
- 4Making a movie rating predictor based on just the length and release date of movies is pretty limited. There are so many more interesting pieces of data about movies that we could use! So let’s add…
- 5In the next three lessons, we’ll implement the three steps of the K-Nearest Neighbor Algorithm: 1.
**Normalize the data**2. Find the k nearest neighbors 3. Classify the new point based on those n… - 6The K-Nearest Neighbor Algorithm: 1. Normalize the data 2.
**Find the k nearest neighbors**3. Classify the new point based on those neighbors — Now that our data has been normalized and we kn… - 7The K-Nearest Neighbor Algorithm: 1. Normalize the data 2. Find the k nearest neighbors 3.
**Classify the new point based on those neighbors**— We’ve now found the k nearest neighbors, and ha… - 8Nice work! Your classifier is now able to predict whether a movie will be good or bad. So far, we’ve only tested this on a completely random point [.4, .2, .9]. In this exercise we’re going to pick…
- 9You’ve now built your first K Nearest Neighbors algorithm capable of classification. You can feed your program a never-before-seen movie and it can predict whether its IMDb rating was above or belo…
- 10In the previous exercise, we found that our classifier got one point in the training set correct. Now we can test
*every*point to calculate the validation accuracy. The validation accuracy change… - 11The graph to the right shows the validation accuracy of our movie classifier as k increases. When k is small, overfitting occurs and the accuracy is relatively low. On the other hand, when k gets t…
- 12You’ve now written your own K-Nearest Neighbor classifier from scratch! However, rather than writing your own classifier every time, you can use Python’s sklearn library. sklearn is a Python librar…
- 13Congratulations! You just implemented your very own classifier from scratch and used Python’s sklearn library. In this lesson, you learned some techniques very specific to the K-Nearest Neighbor al…

## What you'll create

Portfolio projects that showcase your new skills

## How you'll master it

Stress-test your knowledge with quizzes that help commit syntax to memory