One of the most common ways to implement the BoW model in Python is as a dictionary with each key set to a word and each value set to the number of times that word appears. Take the example below:
The words from the sentence go into the bag-of-words and come out as a dictionary of words with their corresponding counts. For statistical models, we call the text that we use to build the model our training data. Usually, we need to prepare our text data by breaking it up into
documents (shorter strings of text, generally sentences).
Let’s build a function that converts a given training text into a bag-of-words!
Define a function
text_to_bow() that accepts
some_text as a variable. Inside the function, set
bow_dictionary equal to an empty dictionary and return it from the function. This is where we’ll be collecting the words and their counts.
Above the return statement, call the
preprocess_text() function we created for you on
some_text and assign the result to the variable
Text preprocessing allows us to count words like “game” and “Games” as the same word token.
Still above the
return, iterate over each
tokens and check if
token is already in the
- If it is, increment that token’s count by
1. (Remember that each
token‘s count is its corresponding value within the
- Otherwise, set the count equal to
1because this is the first time the model has seen that word token.
Uncomment the print statement and run the code to see your bag-of-words function in action!