Word Embeddings
Word Embeddings Are All About Distance

The idea behind word embeddings is a theory known as the distributional hypothesis. This hypothesis states that words that co-occur in the same contexts tend to have similar meanings. With word embeddings, we map words that exist with the same context to similar places in our vector space (math-speak for the area in which our vectors exist).

The numeric values that are assigned to the vector representation of a word are not important in their own right, but gather meaning from how similar or not words are to each other.

Thus the cosine distance between words with similar contexts will be small, and the cosine distance between words that have very different contexts will be large.

The literal values of a word’s embedding have no actual meaning. We gain value in word embeddings from comparing the different word vectors and seeing how similar or different they are. Encoded in these vectors, however, is latent information about how they are used.



In script.py we have loaded a list of the most common 1,000 words in the English language, most_common_words, and their corresponding vector representations as word embeddings, vector_list.

Inspect these lists by printing the word at index 347 in most_common_words and its corresponding word embedding in vector_list to the terminal.


Also given in script.py is a function find_closest_words(). This function accepts the following as arguments:

  • a list of words
  • their corresponding vector representations
  • a target word

The function returns the words that have the smallest cosine distance between their vector representations and the target word.

For example, we could find the most common words closest to “tree” like this:

closest_to_tree = find_closest_words(most_common_words, vector_list, "tree")

Call find_closest_words() with most_common_words, vector_list, and "food" as arguments and save the result to close_to_food. Print close_to_food to the terminal to see the result.

Which words does the function return? Are you surprised?


Again call find_closest_words() with most_common_words and vector_list as arguments, but this time change the last argument to "summer". Save the result to close_to_summer, and print close_to_summer to the terminal.

Which words does the function return? Any surprises this time around?

Feel free to experiment by calling find_closest_words() with a different target word and seeing what results you get!

Folder Icon

Sign up to start coding

Already have an account?