In natural language processing, stemming is the text preprocessing normalization task concerned with bluntly removing word affixes (prefixes and suffixes). For example, stemming would cast the word “going” to “go”. This is a common method used by search engines to improve matching between user input and website hits.
NLTK has a built-in stemmer called PorterStemmer. You can use it with a list comprehension to stem each word in a tokenized list of words.
First, you must import and initialize the stemmer:
from nltk.stem import PorterStemmer stemmer = PorterStemmer()
Now that we have our stemmer, we can apply it to each word in a list using a list comprehension:
tokenized = ['NBC', 'was', 'founded', 'in', '1926', '.', 'This', 'makes', 'NBC', 'the', 'oldest', 'major', 'broadcast', 'network', '.'] stemmed = [stemmer.stem(token) for token in tokenized] print(stemmed) # ['nbc', 'wa', 'found', 'in', '1926', '.', 'thi', 'make', 'nbc', 'the', 'oldest', 'major', 'broadcast', 'network', '.']
Notice, the words like ‘was’ and ‘founded’ became ‘wa’ and ‘found’, respectively. The fact that these words have been reduced is useful for many language processing applications. However, you need to be careful when stemming strings, because words can often be converted to something unrecognizable.
At the top of script.py, import
PorterStemmer, then initialize an instance of it and save the object to a variable called
populated_island and save the result to
Use a list comprehension to stem each word in
island_tokenized. Save the result to a variable called