While it is useful to match and search for patterns of individual characters in a text, you can often find more meaning by analyzing text on a word-by-word basis, focusing on the part of speech of each word in a sentence. This process of identifying and labeling the part of speech of words is known as part-of-speech tagging!
It may have been a while since you’ve been in English class, so let’s review the nine parts of speech with an example:
Wow! Ramona and her class are happily studying the new textbook she has on NLP.
- Noun: the name of a person (
class), place, thing (
textbook), or idea (
- Pronoun: a word used in place of a noun (
- Determiner: a word that introduces, or “determines”, a noun (
- Verb: expresses action (
studying) or being (
- Adjective: modifies or describes a noun or pronoun (
- Adverb: modifies or describes a verb, an adjective, or another adverb (
- Preposition: a word placed before a noun or pronoun to form a phrase modifying another word in the sentence (
- Conjunction: a word that joins words, phrases, or clauses (
- Interjection: a word used to express emotion (
You can automate the part-of-speech tagging process with
pos_tag() function! The function takes one argument, a list of words in the order they appear in a sentence, and returns a list of tuples, where the first entry in the tuple is a word and the second is the part-of-speech tag.
Given the sentence split into a list of words below:
word_sentence = ['do', 'you', 'suppose', 'oz', 'could', 'give', 'me', 'a', 'heart', '?']
you can tag the parts of speech as follows:
part_of_speech_tagged_sentence = pos_tag(word_sentence)
The call to
pos_tag() will return the following:
[('do', 'VB'), ('you', 'PRP'), ('suppose', 'VB'), ('oz', 'NNS'), ('could', 'MD'), ('give', 'VB'), ('me', 'PRP'), ('a', 'DT'), ('heart', 'NN'), ('?', '.')]
Abbreviations are given instead of the full part of speech name. Some common abbreviations include:
NN for nouns,
VB for verbs,
RB for adverbs,
JJ for adjectives, and
DT for determiners. A complete list of part-of-speech tags and their abbreviations can be found here.
Provided to you in the workspace is the text of The Wonderful Wizard of Oz, broken down into individual words on a sentence by sentence basis in a process known as tokenization. These sentences are called word tokenized sentences, which are stored in
Save the value stored at index
word_tokenized_oz to a variable named
witches_fate, and print it. You should see a sentence from the novel, split into individual words, print to the terminal.
Since the text has been broken down to individual words on a sentence by sentence level, you now can part-of-speech tag each word tokenized sentence in The Wonderful Wizard of Oz! Begin by creating an empty list named
pos_tagged_oz to hold the part-of-speech tagged sentences.
Create a for-loop through each word tokenized sentence in
word_tokenized_oz. Within the for-loop, part-of-speech tag each word tokenized sentence and append the result to
Save the part-of-speech tagged sentence at index
100 to a variable named
witches_fate_pos, and print it.