For each sentence, Keras expects a NumPy matrix containing one-hot vectors for each token. What’s a one-hot vector? In a one-hot vector, every token in our set is represented by a
0 except for the current token which is represented by a
1. For example given the vocabulary
["the", "dog", "licked", "me"], a one-hot vector for “dog” would look like
[0, 1, 0, 0].
In order to vectorize our data and later translate it from vectors, it’s helpful to have a features dictionary (and a reverse features dictionary) to easily translate between all the 1s and 0s and actual words. We’ll build out the following:
- a features dictionary for English
- a features dictionary for Spanish
- a reverse features dictionary for English (where the keys and values are swapped)
- a reverse features dictionary for Spanish
Once we have all of our features dictionaries set up, it’s time to vectorize the data! We’re going to need vectors to input into our encoder and decoder, as well as a vector of target data we can use to train the decoder.
Because each matrix is almost all zeros, we’ll use
numpy.zeros() from the NumPy library to build them out.
import numpy as np encoder_input_data = np.zeros( (len(input_docs), max_encoder_seq_length, num_encoder_tokens), dtype='float32')
Let’s break this down:
We defined a NumPy matrix of zeros called
encoder_input_data with two arguments:
- the shape of the matrix — in our case the number of documents (or sentences) by the maximum token sequence length (the longest sentence we want to see) by the number of unique tokens (or words)
- the data type we want — in our case NumPy’s
float32, which can speed up our processing a bit
Hang on… where did all that code go from the previous exercise? Don’t worry, it’s still there; we just moved it over to preprocess.py (all the necessary variables are imported at the top of script.py) to make some room for the new influx of code!
Take a look at the new code. You’ll see we’ve defined a features dictionary for our input vocabulary called
target_features_dict the same way, but using the set of target tokens instead of input tokens.
We’ve also built out the
reverse_input_features_dict, which just swaps keys for values of the
reverse_target_features_dict in the same way, reversing the key-value pairs in
We already have the
encoder_input_data numpy matrix done for you.
Your task is to create the following NumPy matrices with the same arguments as
encoder_input_data, except they should should use the max sequence length for decoder sentences instead of encoder sentences, and the number of decoder tokens instead of encoder tokens:
decoder_input_data: a matrix for the data we’ll pass into the decoder
decoder_target_data: a matrix for the data we expect the decoder to produce
(The two new matrices you create should be identical for now.)