Skip to Content
Learn
Natural Language Parsing with Regular Expressions
Introduction to Chunking

You have made it to the juicy stuff! Given your part-of-speech tagged text, you can now use regular expressions to find patterns in sentence structure that give insight into the meaning of a text. This technique of grouping words by their part-of-speech tag is called chunking.

With chunking in nltk, you can define a pattern of parts-of-speech tags using a modified notation of regular expressions. You can then find non-overlapping matches, or chunks of words, in the part-of-speech tagged sentences of a text.

The regular expression you build to find chunks is called chunk grammar. A piece of chunk grammar can be written as follows:

chunk_grammar = "AN: {<JJ><NN>}"
  • AN is a user-defined name for the kind of chunk you are searching for. You can use whatever name makes sense given your chunk grammar. In this case AN stands for adjective-noun
  • A pair of curly braces {} surround the actual chunk grammar
  • <JJ> operates similarly to a regex character class, matching any adjective
  • <NN> matches any noun, singular or plural

The chunk grammar above will thus match any adjective that is followed by a noun.

To use the chunk grammar defined, you must create a nltk RegexpParser object and give it a piece of chunk grammar as an argument.

chunk_parser = RegexpParser(chunk_grammar)

You can then use the RegexpParser object’s .parse() method, which takes a list of part-of-speech tagged words as an argument, and identifies where such chunks occur in the sentence!

Consider the part-of-speech tagged sentence below:

pos_tagged_sentence = [('where', 'WRB'), ('is', 'VBZ'), ('the', 'DT'), ('emerald', 'JJ'), ('city', 'NN'), ('?', '.')]

You can chunk the sentence to find any adjectives followed by a noun with the following:

chunked = chunk_parser.parse(pos_tagged_sentence)

Instructions

1.

Define a piece of chunk grammar named chunk_grammar that will chunk a single adjective followed by a single noun. Name the chunk AN.

2.

Create a RegexpParser object called chunk_parser using chunk_grammar as an argument.

3.

The part-of-speech tagged novel pos_tagged_oz from the previous exercise has been given to you in the workspace.

Chunk the part-of-speech tagged sentence stored at index 282 in pos_tagged_oz using chunk_parser‘s .parse() method. Save the result to to a variable named scaredy_cat, and print it. The chunked sequences of an adjective followed by a noun will be indicated with an AN, the chunk name you defined earlier.

4.

nltk allows you to better visualize a chunked sentence with the .pretty_print() function. Uncomment the last line in the workspace and run the code to view the chunked sentence. Expand the output terminal all the way to the left to get a better view!

Folder Icon

Sign up to start coding

Already have an account?