If you’re feeling a bit nervous about building this all on your own, never fear. You don’t need to start from scratch — there are a few neural network libraries at your disposal. In our case, we’ll be using TensorFlow with the Keras API to build a pretty limited English-to-Spanish translator (we’ll explain this later and you’ll get an opportunity to improve it).
We can import Keras from Tensorflow like this:
from tensorflow import keras
Also, do not worry about memorizing anything we cover here. The purpose of this lesson is for you to make sense of what each part of the code does and how you can modify it to suit your own needs. In fact, the code we’ll be using is mostly derived from Keras’s own tutorial on the seq2seq model.
First things first: preprocessing the text data. Noise removal depends on your use case — do you care about casing or punctuation? For many tasks they are probably not important enough to justify the additional processing. This might be the time to make changes.
We’ll need the following for our Keras implementation:
- vocabulary sets for both our input (English) and target (Spanish) data
- the total number of unique word tokens we have for each set
- the maximum sentence length we’re using for each language
We also need to mark the start and end of each document (sentence) in the target samples so that the model recognizes where to begin and end its text generation (no book-long sentences for us!). One way to do this is adding
"<START>" at the beginning and
"<END>" at the end of each target document (in our case, this will be our Spanish sentences). For example,
"Estoy feliz." becomes
"<START> Estoy feliz. <END>".
Before you dig into the instructions, read through the existing code in script.py and try to make sense of each line.
If you take a look at span-eng.txt, you’ll see that we’re working with a very tiny data set right now, which will make this a very terrible translator indeed. This is because we don’t want codecademy.com to crash on you! When you build your own translator later, you’ll be using a much larger data set, which will require a great deal more time to process everything.
Use string concatenation to reassign each
target_doc to the value of
target_doc surrounded by
"<START> " and
Then append the
for loop by adding each
token to the corresponding (input or target) tokens set if it hasn’t already been added.
Create two new variables:
num_encoder_tokens: the length of the input tokens set
num_decoder_tokens: the length of the target tokens set