Codecademy Logo

Word Embeddings

Vectors in NLP

In natural language processing, vectors are very important! They contain numerical information that indicates the magnitude of the pieces of data being represented.

The dimension, or size, of a vector can be manipulated to allow for more data to be stored in the vector. Above, three one-dimensional vectors are shown.

In Python, you can represent a vector with a NumPy array. The following creates a vector of even numbers from 2 to 10.

even_nums = np.array([2,4,6,8,10])
A number line that starts at 0 and is labeled with the letter 'X'.  On the number line are three vectors, each begins at 0.   The first vector ends at 3, it is green and labeled 'cat'. Another vector ends with at 8.  This vector is colored orange and labeled 'scrabble'.  A third purple vector ends  28 and is labeled 'antidisestablishmentarianism'.

Word Embeddings

Word embeddings are key to natural language processing. Each is a real number vector representation of a specific word. Contextual information about that word is encoded within the vector numbers.

A basic English word embedding model can be loaded in Python using the spaCy library. This allows access to embeddings for English words.

nlp = spacy.load('en')

Call the model with the desired word as an argument and access the .vector attribute:

nlp('peace').vector

The result would be:

[5.2907305, -4.20267, 1.6989858, -1.422668, -1.500128, ...]

Distance Between Vectors

Measuring distances between word embedding vectors allows us to look at the similarities and differences between words. This type of distance can be calculated using either Manhattan, Euclidean or Cosine distance.

With large-dimensional vectors, Cosine distance is preferred because Manhattan and Euclidean distances can become too large. Cosine produces much smaller values; it measures the angle between two vectors, whereas the other two calculate distances between vector points.

The Python library SciPy contains functions to calculate each easily given two vectors a and b.

from scipy.spatial.distance import cityblock, euclidean, cosine
manhattan_d = cityblock(a,b) # 129.62
euclidean_d = euclidean(a,b) # 16.45
cosine_d = cosine(a,b) # 0.25

Word Contexts in Vector Space

Words used in similar contexts have similar word embeddings, meaning that they are mapped to similar areas in the vector space.

Words in similar contexts have a small cosine distance, while words in different contexts have a large cosine distance.

The vectors for “cat” and “scrabble” point to similar areas of the vector space, and “antidisestablishmentarianism” points to a very different area.

“cat” and “scrabble” would have a small cosine distance, and “cat” and “antidisestablishmentarianism” would have a large one.

A graph example of word contexts in vector space.

The graph has labels on the x-axis at the number three, the number eight, and the number 28. On the y-axis, the graph has labels at the numbers 1, 2, and 11.

Our graph has three vectors, all of them begin at (0, 0), the origin of the graph.  The green vector, labeled 'cat', ends at (3, 1).  The orange vector, labeled 'scrabble', ends at 8,2. The third vector is purple and labeled 'antidisestablishmentarianism', it ends at (28, 11).

Word2Vec Algorithm

To compare word embedding vectors, vector values need to first be appropriately created! Word2vec is a statistical learning algorithm that creates embeddings using written text (a corpus). There are two different architectures of the corpus it can use:

  • Continuous Bag of Words: the algorithm goes through each word in the training corpus, in order, and predicts the word at each position based on applying bag-of-words to surrounding words. The order of the words does not matter!

  • Continuous Skip-Grams: Look at sequences of words that are separated by some specified distance, as opposed to the common practice of looking at groups of n-consecutive words in a text (n-grams). The order of context is taken into consideration!

A graph of word embedding vectors.

Gensim Embeddings Creation

Instead of using pre-trained word embeddings from spaCy, you can create your own unique ones! The Python library gensim allows a word2vec model to be trained on any corpus of text. These embeddings can be any specified dimension and are unique contextual embeddings to that corpus of text.

  • corpus: A list of lists. Each inner list is a document in the corpus, each element in the inner lists is a word token
  • size: The dimensions of the embeddings.
  • Don’t worry about the other arguments.

Model attributes:

  • .wv.vocab.items() – vocabulary of the model
  • .most_similar("cat", topn=10) – 10 most similar words to “cat”
  • .doesnt_match(["mouse", "cat", "europe"]) – identify the word least like others
import gensim
model = gensim.models.Word2Vec(corpus, size=100, window=5, min_count=1, workers=2, sg=1)
0