The key at the heart of word embeddings is distance. Before we explain why, let’s dive into how the distance between vectors can be measured.
There are a variety of ways to find the distance between vectors, and here we will cover three. The first is called Manhattan distance.
In Manhattan distance, also known as city block distance, distance is defined as the sum of the differences across each individual dimension of the vectors. Consider the vectors [1,2,3]
and [2,4,6]
. We can calculate the Manhattan distance between them as shown below:
Another common distance metric is called the Euclidean distance, also known as straight line distance. With this distance metric, we take the square root of the sum of the squares of the differences in each dimension.
The final distance we will consider is the cosine distance. Cosine distance is concerned with the angle between two vectors, rather than by looking at the distance between the points, or ends, of the vectors. Two vectors that point in the same direction have no angle between them, and have a cosine distance of 0
. Two vectors that point in opposite directions, on the other hand, have a cosine distance of 1
. We would show you the calculation, but we don’t want to scare you away! For the mathematically adventurous, you can read up on the calculation here.
We can easily calculate the Manhattan, Euclidean, and cosine distances between vectors using helper functions from SciPy:
from scipy.spatial.distance import cityblock, euclidean, cosine vector_a = np.array([1,2,3]) vector_b = np.array([2,4,6]) # Manhattan distance: manhattan_d = cityblock(vector_a,vector_b) # 6 # Euclidean distance: euclidean_d = euclidean(vector_a,vector_b) # 3.74 # Cosine distance: cosine_d = cosine(vector_a,vector_b) # 0.0
When working with vectors that have a large number of dimensions, such as word embeddings, the distances calculated by Manhattan and Euclidean distance can become rather large. Thus, calculations using cosine distance are preferred!
Instructions
Provided in script.py are the three vectors from the previous exercise, happy_vec
, sad_vec
, and angry_vec
. Use SciPy to compute the Manhattan distance for the following:
- between
happy_vec
andsad_vec
, storing the result in a variableman_happy_sad
- between
sad_vec
andangry_vec
, storing the result in a variableman_sad_angry
Print man_happy_sad
and man_sad_angry
to the terminal.
Which word embeddings are a greater distance apart according to Manhattan distance?
Now use SciPy to compute the Euclidean distance between happy_vec
and sad_vec
, storing the result in a variable euc_happy_sad
, as well as the Euclidean distance between sad_vec
and angry_vec
, storing the result in a variable euc_sad_angry
.
Print euc_happy_sad
and euc_sad_angry
to the terminal.
Which word embeddings are a greater distance apart according to Euclidean distance?
Next stop, cosine city! Use SciPy to compute the cosine distance between happy_vec
and sad_vec
, storing the result in a variable cos_happy_sad
, as well as the cosine distance between sad_vec
and angry_vec
, storing the result in a variable cos_sad_angry
.
Print cos_happy_sad
and cos_sad_angry
to the terminal.
Which word embeddings are further apart according to cosine distance? What else do you notice about the different distance metrics? Are the values similar between the different techniques on each pairing of vectors?