Bag-of-words (BoW) is a statistical language model based on word count. Say what?
Let’s start with that first part: a statistical language model is a way for computers to make sense of language based on probability. For example, let’s say we have the text:
“Five fantastic fish flew off to find faraway functions. Maybe find another five fantastic fish?”
A statistical language model focused on the starting letter for words might take this text and predict that words are most likely to start with the letter “f” because 11 out of 15 words begin that way. A different statistical model that pays attention to word order might tell us that the word “fish” tends to follow the word “fantastic.”
Bag-of-words does not give a flying fish about word starts or word order though; its sole concern is word count — how many times each word appears in a document.
If you’re already familiar with statistical language models, you may also have heard BoW referred to as the unigram model. It’s technically a special case of another statistical model, the n-gram model, with n (the number of words in a sequence) set to
If you have no idea what n-grams are, don’t worry — we’ll dive deeper into them in another lesson.
Try out a few word combinations in the applet to see how the bag-of-words model works. What happens if you add the same word a few times?