And there you go! Now you have the toolkit to dig into any piece of text data and perform natural language parsing with regular expressions. What insights will you gain, or what bias may you uncover? Let’s review what you have learned:
- The
re
module’s.compile()
and.match()
methods allow you to enter any regex pattern and look for a single match at the beginning of a piece of text - The
re
module’s.search()
method lets you find a single match to a regex pattern anywhere in a string, while the.findall()
method finds all the matches of a regex pattern in a string - Part-of-speech tagging identifies and labels the part of speech of words in a sentence, and can be performed in
nltk
using thepos_tag()
function - Chunking groups together patterns of words by their part-of-speech tag. Chunking can be performed in
nltk
by defining a piece of chunk grammar using regular expression syntax and calling aRegexpParser
‘s.parse()
method on a word tokenized sentence - NP-chunking chunks together an optional determiner
DT
, any number of adjectivesJJ
, and a nounNN
to form a noun phrase. The frequency of different NP-chunks can identify important topics in a text or demonstrate how an author describes different subjects - VP-chunking chunks together a verb
VB
, a noun phrase, and an optional adverbRB
to form a verb phrase. The frequency of different VP-chunks can give insight into what kind of action different subjects take or how the actions that different subjects take are described by an author, potentially indicating bias - Chunk filtering provides an alternative means of chunking by specifying what parts of speech you do not want in a chunk and removing them
Instructions
The code in the workspace is set up to perform natural language parsing on The Wonderful Wizard of Oz. However, the chunk grammar is empty! Instead of finding NP-chunks or VP-chunks, define your own chunk grammar using regular expressions in between the curly braces {}
. Feel free to add any chunk filtering in between the inverted braces }{
if you so desire!
Run the code and observe the frequencies of the chunks. What insights or knowledge do the chunk frequencies give you? Have you come to any different conclusions than from analyzing the NP-chunks and VP-chunks?