While it is useful to match and search for patterns of individual characters in a text, you can often find more meaning by analyzing text on a word-by-word basis, focusing on the part of speech of each word in a sentence. This process of identifying and labeling the part of speech of words is known as part-of-speech tagging!
It may have been a while since you’ve been in English class, so let’s review the nine parts of speech with an example:
Wow! Ramona and her class are happily studying the new textbook she has on NLP.
- Noun: the name of a person (
Ramona
,class
), place, thing (textbook
), or idea (NLP
) - Pronoun: a word used in place of a noun (
her
,she
) - Determiner: a word that introduces, or “determines”, a noun (
the
) - Verb: expresses action (
studying
) or being (are
,has
) - Adjective: modifies or describes a noun or pronoun (
new
) - Adverb: modifies or describes a verb, an adjective, or another adverb (
happily
) - Preposition: a word placed before a noun or pronoun to form a phrase modifying another word in the sentence (
on
) - Conjunction: a word that joins words, phrases, or clauses (
and
) - Interjection: a word used to express emotion (
Wow
)
You can automate the part-of-speech tagging process with nltk
‘s pos_tag()
function! The function takes one argument, a list of words in the order they appear in a sentence, and returns a list of tuples, where the first entry in the tuple is a word and the second is the part-of-speech tag.
Given the sentence split into a list of words below:
word_sentence = ['do', 'you', 'suppose', 'oz', 'could', 'give', 'me', 'a', 'heart', '?']
you can tag the parts of speech as follows:
part_of_speech_tagged_sentence = pos_tag(word_sentence)
The call to pos_tag()
will return the following:
[('do', 'VB'), ('you', 'PRP'), ('suppose', 'VB'), ('oz', 'NNS'), ('could', 'MD'), ('give', 'VB'), ('me', 'PRP'), ('a', 'DT'), ('heart', 'NN'), ('?', '.')]
Abbreviations are given instead of the full part of speech name. Some common abbreviations include: NN
for nouns, VB
for verbs, RB
for adverbs, JJ
for adjectives, and DT
for determiners. A complete list of part-of-speech tags and their abbreviations can be found here.
Instructions
Provided to you in the workspace is the text of The Wonderful Wizard of Oz, broken down into individual words on a sentence by sentence basis in a process known as tokenization. These sentences are called word tokenized sentences, which are stored in word_tokenized_oz
.
Save the value stored at index 100
of word_tokenized_oz
to a variable named witches_fate
, and print it. You should see a sentence from the novel, split into individual words, print to the terminal.
Since the text has been broken down to individual words on a sentence by sentence level, you now can part-of-speech tag each word tokenized sentence in The Wonderful Wizard of Oz! Begin by creating an empty list named pos_tagged_oz
to hold the part-of-speech tagged sentences.
Create a for-loop through each word tokenized sentence in word_tokenized_oz
. Within the for-loop, part-of-speech tag each word tokenized sentence and append the result to pos_tagged_oz
.
Save the part-of-speech tagged sentence at index 100
to a variable named witches_fate_pos
, and print it.