Key Concepts

Review core concepts you need to learn to master this subject

K-Nearest Neighbors

The K-Nearest Neighbors algorithm is a supervised machine learning algorithm for labeling an unknown data point given existing labeled data.

The nearness of points is typically determined by using distance algorithms such as the Euclidean distance formula based on parameters of the data. The algorithm will classify a point based on the labels of the K nearest neighbor points, where the value of K can be specified.

KNN of Unknown Data Point

To classify the unknown data point using the KNN (K-Nearest Neighbor) algorithm:

  • Normalize the numeric data
  • Find the distance between the unknown data point and all training data points
  • Sort the distance and find the nearest k data points
  • Classify the unknown data point based on the most instances of nearest k points

Elbow Curve Validation Technique in K-Nearest Neighbor Algorithm

Choosing an optimal k value in KNN determines the number of neighbors we look at when we assign a value to any new observation.

For a very low value of k (suppose k=1), the model overfits on the training data, which leads to a high error rate on the validation set. On the other hand, for a high value of k, the model performs poorly on both train and validation set. When k increases, validation error decreases and then starts increasing in a “U” shape. An optimal value of k can be determined from the elbow curve of the validation error.

K-Nearest Neighbors Underfitting and Overfitting

The value of k in the KNN algorithm is related to the error rate of the model. A small value of k could lead to overfitting as well as a big value of k can lead to underfitting. Overfitting imply that the model is well on the training data but has poor performance when new data is coming. Underfitting refers to a model that is not good on the training data and also cannot be generalized to predict new data.

KNN Classification Algorithm in Scikit Learn

Scikit-learn is a very popular Machine Learning library in Python which provides a KNeighborsClassifier object which performs the KNN classification. The n_neighbors parameter passed to the KNeighborsClassifier object sets the desired k value that checks the k closest neighbors for each unclassified point.

The object provides a .fit() method which takes in training data and a .predict() method which returns the classification of a set of data points.

Euclidean Distance

The Euclidean Distance between two points can be computed, knowing the coordinates of those points.

On a 2-D plane, the distance between two points p and q is the square-root of the sum of the squares of the difference between their x and y components. Remember the Pythagorean Theorem: a^2 + b^2 = c^2 ?

We can write a function to compute this distance. Let’s assume that points are represented by tuples of the form (x_coord, y_coord). Also remember that computing the square-root of some value n can be done in a couple of ways: math.sqrt(n), using the math library, or n ** 0.5 (n raised to the power of 1/2).

def distance(p1, p2): x_diff_squared = (p1[0] - p2[0]) ** 2 y_diff_squared = (p1[1] - p2[1]) ** 2 return (x_diff_squared + y_diff_squared) ** 0.5 distance( (0, 0), (3, 4) ) # => 5.0

Normalizing Data

Normalization is a process of converting the numeric columns in the dataset to a common scale while retaining the underlying differences in the range of values.

For example, Min-max normalization converts each value of the numeric column to a value between 0 and 1 using the formula Normalized value = (NumericValue - MinValue) / (MaxValue - MinValue). A downside of Min-max Normalization is that it does not handle outliers very well.

  1. 1
    In this lesson, you will learn three different ways to define the distance between two points: 1. Euclidean Distance 2. Manhattan Distance 3. Hamming Distance Before diving into the distance form…
  2. 2
    Euclidean Distance is the most commonly used distance formula. To find the Euclidean distance between two points, we first calculate the squared distance between each dimension. If we add up al…
  3. 3
    Manhattan Distance is extremely similar to Euclidean distance. Rather than summing the squared difference between each dimension, we instead sum the absolute value of the difference between eac…
  4. 4
    Hamming Distance is another slightly different variation on the distance formula. Instead of finding the difference of each dimension, Hamming distance only cares about whether the dimensions a…
  5. 5
    Now that you’ve written these three distance formulas yourself, let’s look at how to use them using Python’s SciPy library: - Euclidean Distance .euclidean() - Manhattan Distance .cityblock() - …
  1. 1
    K-Nearest Neighbors (KNN) is a classification algorithm. The central idea is that data points with similar attributes tend to fall into similar categories. Consider the image to the right. Th…
  2. 2
    Before diving into the K-Nearest Neighbors algorithm, let’s first take a minute to think about an example. Consider a dataset of movies. Let’s brainstorm some features of a movie data point. A fe…
  3. 3
    In the first exercise, we were able to visualize the dataset and estimate the k nearest neighbors of an unknown point. But a computer isn’t going to be able to do that! We need to define what it m…
  4. 4
    Making a movie rating predictor based on just the length and release date of movies is pretty limited. There are so many more interesting pieces of data about movies that we could use! So let’s add…
  5. 5
    In the next three lessons, we’ll implement the three steps of the K-Nearest Neighbor Algorithm: 1. Normalize the data 2. Find the k nearest neighbors 3. Classify the new point based on those n…
  6. 6
    The K-Nearest Neighbor Algorithm: 1. Normalize the data 2. Find the k nearest neighbors 3. Classify the new point based on those neighbors — Now that our data has been normalized and we kn…
  7. 7
    The K-Nearest Neighbor Algorithm: 1. Normalize the data 2. Find the k nearest neighbors 3. Classify the new point based on those neighbors — We’ve now found the k nearest neighbors, and ha…
  8. 8
    Nice work! Your classifier is now able to predict whether a movie will be good or bad. So far, we’ve only tested this on a completely random point [.4, .2, .9]. In this exercise we’re going to pick…
  9. 9
    You’ve now built your first K Nearest Neighbors algorithm capable of classification. You can feed your program a never-before-seen movie and it can predict whether its IMDb rating was above or belo…
  10. 10
    In the previous exercise, we found that our classifier got one point in the training set correct. Now we can test every point to calculate the validation accuracy. The validation accuracy change…
  11. 11
    The graph to the right shows the validation accuracy of our movie classifier as k increases. When k is small, overfitting occurs and the accuracy is relatively low. On the other hand, when k gets t…
  12. 12
    You’ve now written your own K-Nearest Neighbor classifier from scratch! However, rather than writing your own classifier every time, you can use Python’s sklearn library. sklearn is a Python librar…
  13. 13
    Congratulations! You just implemented your very own classifier from scratch and used Python’s sklearn library. In this lesson, you learned some techniques very specific to the K-Nearest Neighbor al…

What you'll create

Portfolio projects that showcase your new skills

Pro Logo

How you'll master it

Stress-test your knowledge with quizzes that help commit syntax to memory

Pro Logo