Stopwords are words that we remove during preprocessing when we don’t care about sentence structure. They are usually the most common words in a language and don’t provide any information about the tone of a statement. They include words such as “a”, “an”, and “the”.
NLTK provides a built-in library with these words. You can import them using the following statement:
from nltk.corpus import stopwords stop_words = set(stopwords.words('english'))
We create a set with the stop words so we can check if the words are in a list below.
Now that we have the words saved to
stop_words, we can use tokenization and a list comprehension to remove them from a sentence:
nbc_statement = "NBC was founded in 1926 making it the oldest major broadcast network in the USA" word_tokens = word_tokenize(nbc_statement) # tokenize nbc_statement statement_no_stop = [word for word in word_tokens if word not in stop_words] print(statement_no_stop) # ['NBC', 'founded', '1926', 'making', 'oldest', 'major', 'broadcast', 'network', 'USA']
In this code, we first tokenized our string,
nbc_statement, then used a list comprehension to return a list with all of the stopwords removed.
At the top of your script, import stopwords from NLTK. Save all English stopwords, as a set, to a variable called
Tokenize the text in
survey_text and save the result to
Remove stop words from
tokenized_survey and save the result to