Learn
Text Preprocessing
Stopword Removal

Stopwords are words that we remove during preprocessing when we don’t care about sentence structure. They are usually the most common words in a language and don’t provide any information about the tone of a statement. They include words such as “a”, “an”, and “the”.

NLTK provides a built-in library with these words. You can import them using the following statement:

from nltk.corpus import stopwords stop_words = set(stopwords.words('english'))

We create a set with the stop words so we can check if the words are in a list below.

Now that we have the words saved to stop_words, we can use tokenization and a list comprehension to remove them from a sentence:

nbc_statement = "NBC was founded in 1926 making it the oldest major broadcast network in the USA" word_tokens = word_tokenize(nbc_statement) # tokenize nbc_statement statement_no_stop = [word for word in word_tokens if word not in stop_words] print(statement_no_stop) # ['NBC', 'founded', '1926', 'making', 'oldest', 'major', 'broadcast', 'network', 'USA']

In this code, we first tokenized our string, nbc_statement, then used a list comprehension to return a list with all of the stopwords removed.

Instructions

1.

At the top of your script, import stopwords from NLTK. Save all English stopwords, as a set, to a variable called stop_words.

2.

Tokenize the text in survey_text and save the result to tokenized_survey.

3.

Remove stop words from tokenized_survey and save the result to text_no_stops.

Folder Icon

Sign up to start coding

Already have an account?