At this point we need to fill out the 1
s in each vector. We can loop over each English-Spanish pair in our training sample using the features dictionaries to add a 1
for the token in question. For example, the dog sentence (["the", "dog", "licked", "me"]
) would be split into the following matrix of vectors:
[ [1, 0, 0, 0], # timestep 0 => "the" [0, 1, 0, 0], # timestep 1 => "dog" [0, 0, 1, 0], # timestep 2 => "licked" [0, 0, 0, 1], # timestep 3 => "me" ]
You’ll notice the vectors have timesteps — we use these to track where in a given document (sentence) we are.
To build out a three-dimensional NumPy matrix of one-hot vectors, we can assign a value of 1 for a given word at a given timestep in a given line:
matrix_name[line, timestep, features_dict[token]] = 1.
Keras will fit — or train — the seq2seq model using these matrices of one-hot vectors:
- the encoder input data
- the decoder input data
- the decoder target data
Hang on a second, why build two matrices of decoder data? Aren’t we just encoding and decoding?
The reason has to do with a technique known as teacher forcing that most seq2seq models employ during training. Here’s the idea: we have a Spanish input token from the previous timestep to help train the model for the current timestep’s target token.
Instructions
Inside the first nested for
loop, assign 1.
for the current line
, timestep
, and token
in encoder_input_data
.
Inside the second nested for
loop, assign 1.
for the current line
, timestep
, and token
in decoder_input_data
.
Inside the second nested for
loop, assign 1.
for the current line
, the previous timestep
(at timestep - 1
), and token
in decoder_target_data
if timestep
is greater than 0
.