Lemmatization is a method for casting words to their root forms. This is a more involved process than stemming, because it requires the method to know the part-of-speech for each word. Since lemmatization requires the part of speech, it is a less efficient approach than stemming.
In the next exercise, we will consider how to tag each word with a part of speech. In the meantime, let’s see how to use NLTK’s lemmatize operation.
We can use NLTK’s
WordNetLemmatizer to lemmatize text:
from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer()
Once we have the
lemmatizer initialized, we can use a list comprehension to apply the lemmatize operation to each word in a list:
tokenized = ["NBC", "was", "founded", "in", "1926"] lemmatized = [lemmatizer.lemmatize(token) for token in tokenized] print(lemmatized) # ["NBC", "wa", "founded", "in", "1926"]
The result, saved to
'wa', while the rest of the words remain the same. Not too useful. This happened because
lemmatize() treats every word as a noun. To take advantage of the power of lemmatization, we need to tag each word in our text with the most likely part of speech. We’ll do that in the next exercise.
At the top of script.py, import
WordNetLemmatizer, then initialize an instance of it and save the result to
Tokenize the string saved to
populated_island. Save the result to
Use a list comprehension to lemmatize every word in
tokenized_string. Save the result to the variable