In this course you will learn how to perform Natural Language Parsing with Regular Expressions to discover insights in text data!

Discovering new code words in declassified CIA documents may seem like a mission for a foreign intelligence service, and detecting gender biases in the Harry Potter novels a task for a literature professor. Yet by utilizing _**natural language parsing with regular expressions**_, the power to perform such analyses is in your own hands!

While you may not put much explicit thought into the structure of your sentences as you write, the syntax choices you make are critical in ensuring your writing has meaning. Analyzing such sentence structure as well as word choice can not only provide insights into the connotation of a piece text, but can also <a href="https://medium.com/agatha-codes/a-bossy-sort-of-voice-3c3a18de3093" target="_blank" rel="noopener noreferrer">highlight the biases of its author</a> or <a href="https://twitter.com/hexadecim8/status/1068215227274137605" target="_blank" rel="noopener noreferrer">uncover additional insights that even a deep, rigorous reading of the text might not reveal</a>.

By using <a href="https://docs.python.org/3/library/re.html" target="_blank" rel="noopener noreferrer">_Python's regular expression module`re`_</a> and the <a href="https://www.nltk.org/" target="_blank" rel="noopener noreferrer">_Natural Language Toolkit_</a>, known as NLTK, you can find keywords of interest, discover where and how often they are used, and discern the parts-of-speech patterns in which they appear to understand the sometimes hidden meaning in a piece of writing. Let's get started!


Introduction

Before you dive into more complex syntax parsing, you'll begin with basic regular expressions in Python using the <a href="https://docs.python.org/3/library/re.html" target="_blank" rel="noopener noreferrer">_`re` module_</a> as a regex refresher.

The first method you will explore is _**`.compile()`**_. This method takes a regular expression pattern as an argument and compiles the pattern into a regular expression object, which you can later use to find matching text. The regular expression object below will exactly match `4` upper or lower case characters.

```py
regular_expression_object = re.compile("[A-Za-z]{4}")
```

Regular expression objects have a _**`.match()`**_ method that takes a string of text as an argument and looks for a _single_ match to the regular expression that _starts at the beginning_ of the string. To see if your regular expression matches the string `"Toto"` you can do the following:

```py
result = regular_expression_object.match("Toto")
```

If `.match()` finds a match that starts at the beginning of the string, it will return a match object. The match object lets you know what piece of text the regular expression matched, and at what index the match begins and ends. If there is no match, `.match()` will return `None`.

With the match object stored in `result`, you can access the matched text by calling `result.group(0)`. If you use a regex containing capture groups, you can access these groups by calling `.group()` with the appropriately numbered capture group as an argument.

Instead of compiling the regular expression first and then looking for a match in separate lines of code, you can simplify your match to one line:

```py
result = re.match("[A-Za-z]{4}","Toto")
```

With this syntax, `re`'s `.match()` method takes a regular expression pattern as the first argument and a string as the second argument.


Compiling and Matching

You can make your regular expression matches even more dynamic with the help of the `.search()` method. Unlike `.match()` which will only find matches at the start of a string, _**`.search()`**_ will look left to right through an entire piece of text and return a match object for the first match to the regular expression given. If no match is found, `.search()` will return `None`. For example, to search for a sequence of `8` word characters in the string `Are you a Munchkin?`:

```py
result = re.search("\w{8}","Are you a Munchkin?")
```

Using `.search()` on the string above will find a match of `"Munchkin"`, while using `.match()` on the same string would return `None`!

So far you have used methods that only return one piece of matching text. What if you want to find all the occurrences of a word or keyword in a piece of text to determine a frequency count? Step in the `.findall()` method!

Given a regular expression as its first argument and a string as its second argument, _**`.findall()`**_ will return a list of all _non-overlapping_ matches of the regular expression in the string. Consider the below piece of text:

```py
text = "Everything is green here, while in the country of the Munchkins blue was the favorite color. But the people do not seem to be as friendly as the Munchkins, and I'm afraid we shall be unable to find a place to pass the night."
```

To find all _non-overlapping_ sequences of `8` word characters in the sentence you can do the following:

```py
list_of_matches = re.findall("\w{8}",text)
```

`.findall()` will thus return the list `['Everythi', 'Munchkin', 'favorite', 'friendly', 'Munchkin']`.


Searching and Finding

While it is useful to match and search for patterns of individual characters in a text, you can often find more meaning by analyzing text on a word-by-word basis, focusing on the part of speech of each word in a sentence. This process of identifying and labeling the part of speech of words is known as _**part-of-speech tagging**_!

It may have been a while since you've been in English class, so let's review <a href="https://content.codecademy.com/courses/nlp-regex-parsing/nlp_regex_parsing_part_of_speech_table.pdf" rel="noopener noreferrer">the nine parts of speech</a> with an example:

`Wow! Ramona and her class are happily studying the new textbook she has on NLP.`

* **Noun:** the name of a person (`Ramona`,`class`), place, thing (`textbook`), or idea (`NLP`)
* **Pronoun:** a word used in place of a noun (`her`,`she`)
* **Determiner:** a word that introduces, or "determines", a noun (`the`)
* **Verb:** expresses action (`studying`) or being (`are`,`has`)
* **Adjective:** modifies or describes a noun or pronoun (`new`)
* **Adverb:** modifies or describes a verb, an adjective, or another adverb (`happily`)
* **Preposition:** a word placed before a noun or pronoun to form a phrase modifying another word in the sentence (`on`)
* **Conjunction:** a word that joins words, phrases, or clauses (`and`)
* **Interjection:** a word used to express emotion (`Wow`)

You can automate the part-of-speech tagging process with `nltk`'s `pos_tag()` function! The function takes one argument, a list of words in the order they appear in a sentence, and returns a list of tuples, where the first entry in the tuple is a word and the second is the part-of-speech tag.

Given the sentence split into a list of words below:

```py
word_sentence = ['do', 'you', 'suppose', 'oz', 'could', 'give', 'me', 'a', 'heart', '?']
```

you can tag the parts of speech as follows:

```py
part_of_speech_tagged_sentence = pos_tag(word_sentence)
```

The call to `pos_tag()` will return the following:

```py
[('do', 'VB'), ('you', 'PRP'), ('suppose', 'VB'), ('oz', 'NNS'), ('could', 'MD'), ('give', 'VB'), ('me', 'PRP'), ('a', 'DT'), ('heart', 'NN'), ('?', '.')]
```

Abbreviations are given instead of the full part of speech name. Some common abbreviations include: `NN` for nouns, `VB` for verbs, `RB` for adverbs, `JJ` for adjectives, and `DT` for determiners. A <a href="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html" target="_blank" rel="noopener noreferrer">complete list of part-of-speech tags and their abbreviations can be found here</a>.


Part-of-Speech Tagging

You have made it to the juicy stuff! Given your part-of-speech tagged text, you can now use regular expressions to find patterns in sentence structure that give insight into the meaning of a text. This technique of grouping words by their part-of-speech tag is called _**chunking**_.

With chunking in `nltk`, you can define a pattern of parts-of-speech tags using a modified notation of regular expressions. You can then find non-overlapping matches, or _chunks_ of words, in the part-of-speech tagged sentences of a text.

The regular expression you build to find chunks is called _chunk grammar_. A piece of chunk grammar can be written as follows:

```py
chunk_grammar = "AN: {<JJ><NN>}"
```
* `AN` is a user-defined name for the kind of chunk you are searching for. You can use whatever name makes sense given your chunk grammar. In this case `AN` stands for adjective-noun
* A pair of curly braces `{}` surround the actual chunk grammar
* `<JJ>` operates similarly to a regex character class, matching any adjective
* `<NN>` matches any noun, singular or plural

The chunk grammar above will thus match any adjective that is followed by a noun.

To use the chunk grammar defined, you must create a `nltk` `RegexpParser` object and give it a piece of chunk grammar as an argument.

```py
chunk_parser = RegexpParser(chunk_grammar)
```

You can then use the `RegexpParser` object's `.parse()` method, which takes a list of part-of-speech tagged words as an argument, and identifies where such chunks occur in the sentence!

Consider the part-of-speech tagged sentence below:

```py
pos_tagged_sentence = [('where', 'WRB'), ('is', 'VBZ'), ('the', 'DT'), ('emerald', 'JJ'), ('city', 'NN'), ('?', '.')]
```

You can chunk the sentence to find any adjectives followed by a noun with the following:

```py
chunked = chunk_parser.parse(pos_tagged_sentence)
```

Introduction to Chunking

While you are able to chunk any sequence of parts of speech that you like, there are certain types of chunking that are linguistically helpful for determining meaning and bias in a piece of text. One such type of chunking is _NP-chunking_, or _**noun phrase chunking**_. A _noun phrase_ is a phrase that contains a noun and operates, as a unit, as a noun.

A popular form of noun phrase begins with a _determiner_ `DT`, which specifies the noun being referenced, followed by any number of _adjectives_ `JJ`, which describe the noun, and ends with a noun `NN`.

Consider the part-of-speech tagged sentence below:

```py
[('we', 'PRP'), ('are', 'VBP'), ('so', 'RB'), ('grateful', 'JJ'), ('to', 'TO'), ('you', 'PRP'), ('for', 'IN'), ('having', 'VBG'), ('killed', 'VBN'), ('the', 'DT'), ('wicked', 'JJ'), ('witch', 'NN'), ('of', 'IN'), ('the', 'DT'), ('east', 'NN'), (',', ','), ('and', 'CC'), ('for', 'IN'), ('setting', 'VBG'), ('our', 'PRP$'), ('people', 'NNS'), ('free', 'VBP'), ('from', 'IN'), ('bondage', 'NN'), ('.', '.')]
```

Can you spot the three noun phrases of the form described above? They are:
* `(('the', 'DT'), ('wicked', 'JJ'), ('witch', 'NN'))`
* `(('the', 'DT'), ('east', 'NN'))`
* `(('bondage', 'NN'))`

With the help of a regular expression defined chunk grammar, you can easily find all the _non-overlapping_ noun phrases in a piece of text! Just like in normal regular expressions, you can use quantifiers to indicate how many of each part of speech you want to match.

The chunk grammar for a noun phrase can be written as follows:

```py
chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"
```
* `NP` is the user-defined name of the chunk you are searching for. In this case `NP` stands for noun phrase
* `<DT>` matches any determiner
* `?` is an _optional quantifier_, matching either `0` or `1` determiners
* `<JJ>` matches any adjective
* `*` is the _Kleene star_ quantifier, matching `0` or more occurrences of an adjective
* `<NN>` matches any noun, singular or plural

By finding all the NP-chunks in a text, you can perform a frequency analysis and identify important, recurring noun phrases. You can also use these NP-chunks as pseudo-topics and tag articles and documents by their highest count NP-chunks! Or perhaps your analysis has you looking at the adjective choices an author makes for different nouns. 

It is ultimately up to you, with your knowledge of the text you are working with, to interpret the meaning and use-case of the NP-chunks and their frequency of occurrence.


Chunking Noun Phrases

Another popular type of chunking is _VP-chunking_, or _**verb phrase chunking**_. A _verb phrase_ is a phrase that contains a verb and its complements, objects, or modifiers.

Verb phrases can take a variety of structures, and here you will consider two. The first structure begins with a verb `VB` of any tense, followed by a noun phrase, and ends with an optional adverb `RB` of any form. The second structure switches the order of the verb and the noun phrase, but also ends with an optional adverb.

Both structures are considered because verb phrases of each form are essentially the same in meaning. For example, consider the part-of-speech tagged verb phrases given below:

* `(('said', 'VBD'), ('the', 'DT'), ('cowardly', 'JJ'), ('lion', 'NN'))`
* `('the', 'DT'), ('cowardly', 'JJ'), ('lion', 'NN')), (('said', 'VBD'), `

The chunk grammar to find the first form of verb phrase is given below:

```py
chunk_grammar = "VP: {<VB.*><DT>?<JJ>*<NN><RB.?>?}"
```
* `VP` is the user-defined name of the chunk you are searching for. In this case `VP` stands for verb phrase
* `<VB.*>` matches any verb using the `.` as a wildcard and the `*` quantifier to match `0` or more occurrences of any character. This ensures matching verbs of any tense (ex. `VB` for present tense, `VBD` for past tense, or `VBN` for past participle)
* `<DT>?<JJ>*<NN>` matches any noun phrase
* `<RB.?>` matches any adverb using the `.` as a wildcard and the _optional quantifier_ to match `0` or `1` occurrence of any character. This ensures matching any form of adverb (regular `RB`, comparative `RBR`, or superlative `RBS`)
* `?` is an _optional quantifier_, matching either `0` or `1` adverbs

The chunk grammar for the second form of verb phrase is given below:

```py
chunk_grammar = "VP: {<DT>?<JJ>*<NN><VB.*><RB.?>?}"
```

Just like with NP-chunks, you can find all the VP-chunks in a text and perform a frequency analysis to identify important, recurring verb phrases. These verb phrases can give insight into what kind of action different characters take or how the actions that characters take are described by the author.

Once again, this is the part of the analysis where you get to be creative and use your own knowledge about the text you are working with to find interesting insights!

Chunking Verb Phrases

Another option you have to find chunks in your text is _chunk filtering_. _**Chunk filtering**_ lets you define what parts of speech you _do not want_ in a chunk and remove them.

A popular method for performing chunk filtering is to chunk an entire sentence together and then indicate which parts of speech are to be filtered out.  If the filtered parts of speech are in the middle of a chunk, it will split the chunk into two separate chunks! The chunk grammar you can use to perform chunk filtering is given below:

```py
chunk_grammar = """NP: {<.*>+}
                       }<VB.?|IN>+{"""
```
* `NP` is the user-defined name of the chunk you are searching for. In this case `NP` stands for noun phrase
* The brackets `{}` indicate what parts of speech you are chunking. `<.*>+` matches every part of speech in the sentence
* The inverted brackets `}{` indicate which parts of speech you want to filter from the chunk. `<VB.?|IN>+` will filter out any verbs or prepositions

Chunk filtering provides an alternate way for you to search through a text and find the chunks of information useful for your analysis!


Chunk Filtering

And there you go! Now you have the toolkit to dig into any piece of text data and perform natural language parsing with regular expressions. What insights will you gain, or what bias may you uncover? Let's review what you have learned:
* The `re` module's `.compile()` and `.match()` methods allow you to enter any regex pattern and look for a single match at the beginning of a piece of text
* The `re` module's `.search()` method lets you find a single match to a regex pattern anywhere in a string, while the `.findall()` method finds all the matches of a regex pattern in a string
* _Part-of-speech tagging_ identifies and labels the part of speech of words in a sentence, and can be performed in `nltk` using the `pos_tag()` function
* _Chunking_ groups together patterns of words by their part-of-speech tag. Chunking can be performed in `nltk` by defining a piece of chunk grammar using regular expression syntax and calling a `RegexpParser`'s `.parse()` method on a word tokenized sentence
* _NP-chunking_ chunks together an optional determiner `DT`, any number of adjectives `JJ`, and a noun `NN` to form a noun phrase. The frequency of different NP-chunks can identify important topics in a text or demonstrate how an author describes different subjects
* _VP-chunking_ chunks together a verb `VB`, a noun phrase, and an optional adverb `RB` to form a verb phrase. The frequency of different VP-chunks can give insight into what kind of action different subjects take or how the actions that different subjects take are described by an author, potentially indicating bias
* _Chunk filtering_ provides an alternative means of chunking by specifying what parts of speech you do not want in a chunk and removing them

Review

Natural Language Parsing with Regular Expressions

Apply regular expressions (regex) and other natural language parsing tactics to find meaning and insights in the texts you read every day.

Language Parsing

match a regex pattern anywhere in a string

match a regex pattern at the beginning of a string

find all matches to a regex pattern in a string

check to see if two regular expressions are equivalent

import re

text = "We are what we pretend to be, so we must be careful about what we pretend to be."
result = re.search('\w{7}',text)
matched_text = result.group(0)

[('Everything', 'NN'), ('was', 'VBD'), ('beautiful', 'JJ'), ('and', 'CC'), ('nothing', 'NN'), ('hurt', 'VBN'), ('.', '.')]

[('Everything', 'NN'), ('was', 'JJ'), ('beautiful', 'VBD'), ('and', 'CC'), ('nothing', 'NN'), ('hurt', 'VBN'), ('.', '.')]

[('Everything', 'JJ'), ('was', 'VBD'), ('beautiful', 'NN'), ('and', 'CC'), ('nothing', 'NN'), ('hurt', 'VBN'), ('.', '.')]

[('Everything', 'NN'), ('was', 'CC'), ('beautiful', 'JJ'), ('and', 'NN'), ('nothing', 'NN'), ('hurt', 'VBN'), ('.', '.')]

The `.parse()` method uses the `RegexpParser`'s chunk grammar to chunk a sentence that is passed to it.

The `.parse()` creates a piece of chunk grammar that can be used to chunk a sentence.

The `.parse()` method uses the `RegexpParser`'s chunk grammar to define a noun phrase.

The `.parse()` creates a match object that stores the matched chunks.

A noun phrase can consist of an optional determiner `DT`, any number of adjectives `JJ`, and a noun `NN`.

A noun phrase can be chunked with the following chunk grammar: `chunk_grammar = "NP: {<VB.*><DT>?<JJ>*<NN><RB.?>?}"`

A noun phrase can only consist of a noun `NN` and an adjective `JJ`.

A noun phrase cannot be found with chunk filtering.

Practice what you've learned about Natural Language Parsing with Regular Expressions in this multiple choice quiz!

Perform a natural language parsing analysis with regular expressions to gain insights into Oscar Wilde's _The Picture of Dorian Gray_ or Homer's _The Iliad!_

Given to you in the code editor are text files for the _The Picture of Dorian Gray_, named `dorian_gray.txt`, and _The Iliad_, named `the_iliad.txt`, sourced from <a href="https://www.gutenberg.org/" target="_blank" rel="noopener noreferrer">Project Gutenberg</a>. Import the text of your choosing, convert it to lowercase, and name it `text` using the following line of code.

```py
text = open("_______.txt",encoding='utf-8').read().lower()
```

Replace the blank with the name of the text file for the novel you choose to analyze!

With the text imported, now you need to split the text into individual sentences and then individual words. This allows you to perform a sentence-by-sentence parsing analysis!

Provided to you in the code editor is a customized function `word_sentence_tokenize()` that will sentence tokenize a text and then word tokenize each sentence, returning a list of word tokenized sentences. Call the function with `text` as an argument and save the result to a variable named `word_tokenized_text`.

Save any word tokenized sentence in `word_tokenized_text` to a variable named `single_word_tokenized_sentence`. Print `single_word_tokenized_sentence` as a check to visualize what you have done so far!

Next you can part-of-speech tag each sentence to allow for syntax parsing! Begin by creating a list named `pos_tagged_text` that will hold each part-of-speech tagged sentence from the novel.

Loop through each word tokenized sentence in `word_tokenized_text` and part-of-speech tag each sentence using `nltk`'s `pos_tag()` function. Append the result to `pos_tagged_text`.

Save any part-of-speech tagged sentence in `pos_tagged_text` to a variable named `single_pos_sentence`. Print `single_pos_sentence` as a check to visualize what you have done so far!

Now that you have part-of-speech tagged your text, you can move on to syntax parsing!

Begin by defining a piece of chunk grammar `np_chunk_grammar` that will chunk a noun phrase. Remember, a noun phrase consists of an optional determiner `DT`, followed by any number of adjectives `JJ`, followed by a noun `NN`.

Create a `nltk` `RegexpParser` object named `np_chunk_parser` using the noun phrase chunk grammar you defined as an argument.

Define a piece of chunk grammar named `vp_chunk_grammar` that will chunk a verb phrase of the following form: noun phrase, followed by a verb `VB`, followed by an optional adverb `RB`.

Create a `nltk` `RegexpParser` object named `vp_chunk_parser` using the verb phrase chunk grammar you defined as an argument.

Create two empty lists `np_chunked_text` and `vp_chunked_text` that will hold the chunked sentences from your text.

Loop through each part-of-speech tagged sentence in `pos_tagged_text` and noun phrase chunk each sentence using your `RegexpParser`'s `.parse()` method. Append the result to `np_chunked_text`.

Within the same loop you defined in the previous task, verb phrase chunk each part-of-speech tagged sentence using your `RegexpParser`'s `.parse()` method. Append the result to `vp_chunked_text`.

Now that you have chunked your novel, you can analyze the chunk frequencies to gain insights!

A function `np_chunk_counter()` that returns the `30` most common NP-chunks from a list of chunked sentences has been imported to the workspace for you. Call `np_chunk_counter()` with `np_chunked_text` as an argument and save the result to a variable named `most_common_np_chunks`. Print `most_common_np_chunks`. What sticks out to you about the most common noun phrase chunks? Are you surprised by anything? Open the hint to see our analysis.

Want to see how `np_chunk_counter()` works? Use the file navigator to open `chunk_counters.py` and inspect `np_chunk_counter()`.

A function `vp_chunk_counter()` that returns the `30` most common VP-chunks from a list of chunked sentences has been imported to the workspace for you. Call `vp_chunk_counter()` with `vp_chunked_text` as an argument and save the result to a variable named `most_common_vp_chunks`. Print `most_common_vp_chunks`. What sticks out to you about the most common verb phrase chunks? Are you surprised by anything? Open the hint to see our analysis.

Want to see how `vp_chunk_counter()` works? Use the file navigator to open `chunk_counters.py` and inspect `vp_chunk_counter()`.

Amazing! You have performed a syntax parsing analysis on a novel and gained insight into both the meaning of the text and how the author thinks, without reading a page!

Now's your chance to get creative. Is there a different pattern of parts-of-speech you want to identify and count in the novel you selected? Add a new piece of chunk grammar and repeat the process of chunking. What do you find?

Not the biggest fan of _The Picture of Dorian Gray_ or _The Iliad?_ No worries! Included in the file navigator is a blank text file named `my_text.txt`. Open the file and copy any text of your choice (novel, script, article, etc.) into the file. Save the file and then return to `script.py`. Update the opened text file to `my_text.txt` and rerun `script.py` to perform a syntax parsing analysis on your text! What insights or deeper meanings did you discover?

Discover Insights into Classic Texts

`(hamsters|mice) are cuter than (gerbils|guinea pigs)`

`\w are cuter than (gerbils|guinea pigs)`

Practice what you've learned about regular expressions with this multiple choice quiz!

Introduction to Regular Expressions

In this lesson you will learn the syntax of regular expressions and how you can utilize them to match and search text data!

When registering an account for a new social media app or completing an order for a gift online, nearly every piece of information you enter into a web form is validated. Did you enter a properly formatted email including an `@` symbol? Did you enter a phone number `10` digits long, with or without `-`s and parentheses? And then there's the king of them all — did your new password meet the seemingly growing number of requirements for inclusion (and exclusion) of symbols, digits, and both upper and lower case letters?

While correcting each field in our digital lives for proper format can be a pain, it's integral to ensuring that our accounts are secure, our packages are successfully delivered, and that we can be contacted by phone and email.

The technology that fuels this verification system on nearly every website and application is the ever reliable, often quirky language of **_regular expressions_**, commonly shortened to regex, as we will use here, or regexp (<a href="https://english.stackexchange.com/questions/94371/what-is-the-correct-pronunciation-of-regex" target="_blank" rel="noopener noreferrer">pronunciation is up for debate</a>). A _regular expression_ is a special sequence of characters that describe a pattern of text that should be found, or matched, in a string or document. By matching text, we can identify how often and where certain pieces of text occur, as well as have the opportunity to replace or update these pieces of text if needed.

Regular Expressions have a variety of use cases including:
* validating user input in HTML forms
* verifying and parsing text in files, code and applications
* examining test results
* finding keywords in emails and web pages

While there are [a variety of implementations of Regular Expressions across platforms](https://en.wikipedia.org/wiki/Regular_expression#History), in this lesson you will learn the basics that apply everywhere. By the lesson's end, you'll be empowered to use them in your own projects (<a href="https://xkcd.com/208/" target="_blank" rel="noopener noreferrer">and become a regex superhero</a>)!


Humorous XKCD comic about regular expressions.

The simplest text we can match with regular expressions are **_literals_**. This is where our regular expression contains the exact text that we want to match. The regex `a`, for example, will match the text `a`, and the regex `bananas` will match the text `bananas`.

We can additionally match just part of a piece of text. Perhaps we are searching a document to see if the word `monkey` occurs, since we love monkeys. We could use the regex `monkey` to match `monkey` in the piece of text `The monkeys like to eat bananas.`.

Not only are we able to match alphabetical characters — digits work as well! The regex `3` will match the `3` in the piece of text `34`, and the regex `5 gibbons` will completely match the text `5 gibbons`!

Regular expressions operate by moving character by character, from left to right, through a piece of text. When the regular expression finds a character that matches the first piece of the expression, it looks to find a continuous sequence of matching characters.

Literals

Do you love baboons and gorillas? You can find either of them with the same regular expression using **_alternation!_** Alternation, performed in regular expressions with the pipe symbol, `|`, allows us to match either the characters preceding the `|` OR the characters after the `|`. The regex `baboons|gorillas` will match `baboons` in the text `I love baboons`, but will also match `gorillas` in the text `I love gorillas`.

Are you thinking about how to match the whole piece of text `I love baboons` or `I love gorillas`? We will get to that later on!

Alternation

Spelling tests may seem like a distant memory from grade school, but we ultimately take them every day while typing. It's easy to make mistakes on commonly misspelled words like `consensus`, and on top of that, there are sometimes alternate spellings for the same word. 

**_Character sets_**, denoted by a pair of brackets `[]`, let us match one character from a series of characters, allowing for matches with incorrect or different spellings.

The regex `con[sc]en[sc]us` will match `consensus`, the correct spelling of the word, but also match the following three incorrect spellings: `concensus`, `consencus`, and `concencus`. The letters inside the first brackets, `s` and `c`, are the different possibilities for the character that comes after `con` and before `en`. Similarly for the second brackets, `s` and `c` are the different character possibilities to come after `en` and before `us`.

Thus the regex `[cat]` will match the characters `c`, `a`, _or_ `t`, but not the text `cat`. 

The beauty of character sets (and alternation) is that they allow our regular expressions to become more flexible and less rigid than by just matching with literals!

We can make our character sets even more powerful with the help of the caret `^` symbol. Placed at the front of a character set, the `^` negates the set, matching any character that is _not_ stated. These are called _negated character sets_. Thus the regex `[^cat]` will match any character that is not `c`, `a`, _or_ `t`, and would completely match each character `d`, `o` _or_ `g`.

Do we have a consensus that regular expressions are pretty cool?

Character Sets

Sometimes we don't care exactly WHAT characters are in a text, just that there are SOME characters. Enter the wildcard `.`! **_Wildcards_** will match any single character (letter, number, symbol or whitespace) in a piece of text. They are useful when we do not care about the specific value of a character, but only that a character exists!

Let's say we want to match any 9-character piece of text. The regex `.........` will completely match `orangutan` and `marsupial`! Similarly, the regex `I ate . bananas` will completely match both `I ate 3 bananas` and `I ate 8 bananas`!

What happens if we want to match an actual period, `.`? We can use the escape character, `\`, to escape the wildcard functionality of the `.` and match an actual period. The regex `Howler monkeys are really lazy\.` will completely match the text `Howler monkeys are really lazy.`.

Wild for Wildcards

Character sets are great, but their true power isn't realized without ranges. **_Ranges_** allow us to specify a range of characters in which we can make a match without having to type out each individual character. The regex `[abc]`, which would match any character `a`, `b`, _or_ `c`, is equivalent to regex range `[a-c]`. The `-` character allows us to specify that we are interested in matching a range of characters.

The regex `I adopted [2-9] [b-h]ats` will match the text `I adopted 4 bats` as well as `I adopted 8 cats` and even `I adopted 5 hats`.

With ranges we can match any single capital letter with the regex `[A-Z]`, lowercase letter with the regex `[a-z]`, any digit with the regex `[0-9]`. We can even have multiple ranges in the same character set! To match any single capital _or_ lowercase alphabetical character, we can use the regex `[A-Za-z]`. 

Remember, within any character set `[]` we only match _one_ character.

Ranges

While character ranges are extremely useful, they can be cumbersome to write out every single time you want to match common ranges such as those that designate alphabetical characters or digits. To alleviate this pain, there are **_shorthand character classes_** that represent common ranges, and they make writing regular expressions much simpler. These shorthand classes include:
* `\w`: the "word character" class represents the regex range `[A-Za-z0-9_]`, and it matches a single uppercase character, lowercase character, digit or underscore
* `\d`: the "digit character" class represents the regex range `[0-9]`, and it matches a single digit character
* `\s`: the "whitespace character" class represents the regex range `[ \t\r\n\f\v]`, matching a single space, tab, carriage return, line break, form feed, or vertical tab

For example, the regex `\d\s\w\w\w\w\w\w\w` matches a digit character, followed by a whitespace character, followed by 7 word characters. Thus the regex completely matches the text `3 monkeys`.

In addition to the shorthand character classes `\w`, `\d`, and `\s`, we also have access to the _negated shorthand character classes_! These shorthands will match any character that is NOT in the regular shorthand classes. These negated shorthand classes include:
* `\W`: the "non-word character" class represents the regex range `[^A-Za-z0-9_]`, matching any character that is not included in the range represented by `\w`
* `\D`: the "non-digit character" class represents the regex range `[^0-9]`, matching any character that is not included in the range represented by `\d`
* `\S`:  the "non-whitespace character" class represents the regex range `[^ \t\r\n\f\v]`, matching any character that is not included in the range represented by `\s`

Shorthand Character Classes

Remember when we were in love with baboons and gorillas a few exercises ago? We were able to match either `baboons` or `gorillas` using the regex `baboons|gorillas`, taking advantage of the `|` symbol.

But what if we want to match the whole piece of text `I love baboons` and `I love gorillas` with the same regex? Your first guess might be to use the regex `I love baboons|gorillas`. This regex, while it would completely match the string `I love baboons`, would not match `I love gorillas`, and would instead match `gorillas`. This is because the `|` symbol matches the _entire_ expression before or after itself.

Grouping to the rescue! **_Grouping_**, denoted with the open parenthesis `(` and the closing parenthesis `)`, lets us group parts of a regular expression together, and allows us to limit alternation to part of the regex.

The regex `I love (baboons|gorillas)` will match the text `I love ` and _then_ match either `baboons` or `gorillas`, as the grouping limits the reach of the `|` to the text within the parentheses.

These groups are also called _capture groups_, as they have the power to select, or capture, a substring from our matched text.

Grouping

Here's where things start to get really interesting. So far we have only matched text on a character by character basis. But instead of writing the regex `\w\w\w\w\w\w\s\w\w\w\w\w\w`, which would match 6 word characters, followed by a whitespace character, and then followed by more 6 word characters, such as in the text `rhesus monkey`, is there a better way to denote the quantity of characters we want to match?

The answer is yes, with the help of quantifiers! **_Fixed quantifiers_**, denoted with curly braces `{}`, let us indicate the exact quantity of a character we wish to match, or allow us to provide a quantity range to match on.
* `\w{3}` will match _exactly_ 3 word characters
* `\w{4,7}` will match _at minimum_ 4 word characters and _at maximum_ 7 word characters

The regex `roa{3}r` will match the characters `ro` followed by `3` `a`s, and then the character `r`, such as in the text `roaaar`. The regex `roa{3,7}r` will match the characters `ro` followed by _at least_ `3` `a`s and _at most_ `7` `a`s, followed by an `r`, matching the strings `roaaar`, `roaaaaar` and `roaaaaaaar`.

An important note is that quantifiers are considered to be _greedy_. This means that they will match the greatest quantity of characters they possibly can. For example, the regex `mo{2,4}` will match the text `moooo` in the string `moooo`, and not return a match of `moo`, or `mooo`. This is because the fixed quantifier wants to match the largest number of `o`s as possible, which is `4` in the string `moooo`.

Quantifiers - Fixed

You are working on a research project that summarizes the findings of primate behavioral scientists from around the world. Of particular interest to you are the scientists' observations of humor in chimpanzees, so you whip up some regex to find all occurrences of the word `humor` in the documents you have collected. To your dismay, your regex misses the observations of amusement written by scientists hailing from British English speaking countries, where the spelling of the word is `humour`. Optional quantifiers to the rescue!

**_Optional quantifiers_**, indicated by the question mark `?`, allow us to indicate a character in a regex is optional, or can appear either `0` times or `1` time. For example, the regex `humou?r` matches the characters `humo`, then either `0` occurrences or `1` occurrence of the letter `u`, and finally the letter `r`. Note the `?` _only_ applies to the character directly before it.

With all quantifiers, we can take advantage of grouping to make even more advanced regexes. The regex `The monkey ate a (rotten )?banana` will completely match both `The monkey ate a rotten banana` and `The monkey ate a banana`.

Since the `?` is a metacharacter, you need to use the escape character in your regex in order to match a question mark `?` in a piece of text. The regex `Aren't owl monkeys beautiful\?` will thus completely match the text `Aren't owl monkeys beautiful?`.

Quantifiers - Optional

In 1951, mathematician Stephen Cole Kleene developed a system to match patterns in written language with mathematical notation. This notation is now known as regular expressions!

In his honor, the next piece of regular expressions syntax we will learn is known as the Kleene star. The **_Kleene star_**, denoted with the asterisk `*`, is also a quantifier, and matches the preceding character `0` or more times. This means that the character doesn't need to appear, can appear once, or can appear many many times.

The regex `meo*w` will match the characters `me`, followed by `0` or more `o`s, followed by a `w`. Thus the regex will match `mew`, `meow`, `meooow`, and `meoooooooooooow`.

Another useful quantifier is the **_Kleene plus_**, denoted by the plus `+`, which matches the preceding character `1` or more times. 

The regex `meo+w` will match the characters `me`, followed by `1` or more `o`s, followed by a `w`. Thus the regex will match `meow`, `meooow`, and `meoooooooooooow`, but not match  `mew`.

Like all the other metacharacters, in order to match the symbols `*` and `+`, you need to use the escape character in your regex. The regex `My cat is a \*` will completely match the text `My cat is a *`.

Quantifiers - 0 or More, 1 or More

When writing regular expressions, it's useful to make the expression as specific as possible in order to ensure that we do not match unintended text. To aid in this mission of specificity, we can use the anchor metacharacters. The **_anchors_** hat `^` and dollar sign `$` are used to match text at the start and the end of a string, respectively.

The regex `^Monkeys: my mortal enemy$` will completely match the text `Monkeys: my mortal enemy` but not match `Spider Monkeys: my mortal enemy in the wild` or `Squirrel Monkeys: my mortal enemy in the wild`. The `^` ensures that the matched text begins with `Monkeys`, and the `$` ensures the matched text ends with `enemy`.

Without the anchor tags, the regex `Monkeys: my mortal enemy` will match the text `Monkeys: my mortal enemy` in both `Spider Monkeys: my mortal enemy in the wild` and `Squirrel Monkeys: my mortal enemy in the wild`.

Once again, as with all other metacharacters, in order to match the symbols `^` and `$`, you need to use the escape character in your regex. The regex `My spider monkey has \$10\^6 in the bank` will completely match the text `My spider monkey has $10^6 in the bank`.


Anchors

Do you feel those regular expression superpowers coursing through your body? Do you just want to scream `ah+` really loud? Awesome! You are now ready to take these skills and use them out in the wild. Before beginning your adventures, let's review what we've learned.
* _Regular expressions_ are special sequences of characters that describe a pattern of text that is to be matched
* We can use _literals_ to match the exact characters that we desire
* _Alternation_, using the pipe symbol `|`, allows us to match the text preceding or following the `|`
* _Character sets_, denoted by a pair of brackets `[]`, let us match one character from a series of characters
* _Wildcards_, represented by the period or dot `.`, will match any single character (letter, number, symbol or whitespace)
* _Ranges_ allow us to specify a range of characters in which we can make a match
* _Shorthand character classes_ like `\w`, `\d` and `\s` represent the ranges representing word characters, digit characters, and whitespace characters, respectively
* _Groupings_, denoted with parentheses `()`, group parts of a regular expression together, and allows us to limit alternation to part of a regex
* _Fixed quantifiers_, represented with curly braces `{}`, let us indicate the exact quantity or a range of quantity of a character we wish to match
* _Optional quantifiers_, indicated by the question mark `?`, allow us to indicate a character in a regex is optional, or can appear either `0` times or `1` time
* The _Kleene star_, denoted with the asterisk `*`, is a quantifier that matches the preceding character `0` or more times
* The _Kleene plus_, denoted by the plus `+`, matches the preceding character `1` or more times
* The _anchor_ symbols hat `^` and dollar sign `$` are used to match text at the start and end of a string, respectively

Have you ever wondered how computers break down language? This course introduces text parsing techniques using regular expressions (regex) and Part-of-Speech (POS) Tagging with NLTK.



Get taste of regular expressions (regex), a powerful search pattern language to quickly find the text you're looking for.

Get a taste of regular expressions (regex), a powerful search pattern language to quickly find the text you're looking for.

Apply regular expressions (regex) and other natural language parsing tactics to find meaning and insights in the texts you read every day.