Now that you know what a bag-of-words vector looks like, you can create a function that builds them!
First, we need a way of generating a features dictionary from a list of training documents. We can build a Python function to do that for us…
Define a function
create_features_dictionary() that takes one argument,
documents. This will be the list of string documents that we pass in (like
["All the cool fish love to fly high.", "Nobody knows why the fish fly so high.", "Those cool fish sure are spry."]).
Inside the function, set
features_dictionary equal to an empty dictionary. This is where we’ll map all of our terms to index numbers. For now, return
features_dictionary from the function.
Above the return statement, merge the
documents into a string joined together by spaces and assign the result to
Now that the documents are all in a single string, call
merged and assign the result to
tokens from the function in addition to
Above the return statement, assign
index a value of
0. This will correspond to the first word’s vector index.
The words are prepared, the empty dictionary is prepared, and we have an index number we can use; it’s time to get the words into the dictionary and link each to a vector index number!
- Above the
return, loop through each
- In the loop, check if
tokenis NOT in
- If it’s a new word, add
tokenas a key to
features_dictionarywith a value of
1 so that each new word has its own index.
Uncomment the print statement to test out the function!