Skip to Content
Learn
Natural Language Parsing with Regular Expressions
Review

And there you go! Now you have the toolkit to dig into any piece of text data and perform natural language parsing with regular expressions. What insights will you gain, or what bias may you uncover? Let’s review what you have learned:

  • The re module’s .compile() and .match() methods allow you to enter any regex pattern and look for a single match at the beginning of a piece of text
  • The re module’s .search() method lets you find a single match to a regex pattern anywhere in a string, while the .findall() method finds all the matches of a regex pattern in a string
  • Part-of-speech tagging identifies and labels the part of speech of words in a sentence, and can be performed in nltk using the pos_tag() function
  • Chunking groups together patterns of words by their part-of-speech tag. Chunking can be performed in nltk by defining a piece of chunk grammar using regular expression syntax and calling a RegexpParser‘s .parse() method on a word tokenized sentence
  • NP-chunking chunks together an optional determiner DT, any number of adjectives JJ, and a noun NN to form a noun phrase. The frequency of different NP-chunks can identify important topics in a text or demonstrate how an author describes different subjects
  • VP-chunking chunks together a verb VB, a noun phrase, and an optional adverb RB to form a verb phrase. The frequency of different VP-chunks can give insight into what kind of action different subjects take or how the actions that different subjects take are described by an author, potentially indicating bias
  • Chunk filtering provides an alternative means of chunking by specifying what parts of speech you do not want in a chunk and removing them

Instructions

The code in the workspace is set up to perform natural language parsing on The Wonderful Wizard of Oz. However, the chunk grammar is empty! Instead of finding NP-chunks or VP-chunks, define your own chunk grammar using regular expressions in between the curly braces {}. Feel free to add any chunk filtering in between the inverted braces }{ if you so desire!

Run the code and observe the frequencies of the chunks. What insights or knowledge do the chunk frequencies give you? Have you come to any different conclusions than from analyzing the NP-chunks and VP-chunks?

Folder Icon

Sign up to start coding

Already have an account?