Delve into the exciting world of NLP with this overview of major topics in the field.

Look at the technologies around us:
- Spellcheck and autocorrect
- Auto-generated video captions
- Virtual assistants like Amazon's Alexa
- Autocomplete
- Your news site's suggested articles

What do they have in common?

All of these handy technologies exist because of **_natural language processing!_** Also known as **_NLP_**, the field is at the intersection of linguistics, artificial intelligence, and computer science. The goal? Enabling computers to interpret, analyze, and approximate the generation of human languages (like English or Spanish).

NLP got its start around 1950 with Alan Turing's test for artificial intelligence evaluating whether a computer can use language to fool humans into believing it's human. 

But approximating human speech is only one of a wide range of applications for NLP! Applications from detecting spam emails or bias in tweets to improving accessibility for people with disabilities all rely heavily on natural language processing techniques.

NLP can be conducted in several programming languages. However, Python has some of the most extensive open-source NLP libraries, including the <a href="https://www.nltk.org/" target="_blank">Natural Language Toolkit</a> or **_NLTK_**. Because of this, you'll be using Python to get your first taste of NLP. 



Intro to NLP

<blockquote>"You never know what you have... until you clean your data."</blockquote>
<p style="text-align: right">~ Unknown (or possibly made up)</p>

Cleaning and preparation are crucial for many tasks, and NLP is no exception. **_Text preprocessing_** is usually the first step you'll take when faced with an NLP task. 

Without preprocessing, your computer interprets `"the"`, `"The"`, and `"<p>The"` as entirely different words. There is a LOT you can do here, depending on the formatting you need. Lucky for you, <a href="https://en.wikipedia.org/wiki/Regular_expression" target="_blank">Regex</a> and NLTK will do most of it for you! Common tasks include:

**Noise removal** — stripping text of formatting (e.g., HTML tags).

**Tokenization** — breaking text into individual words.

**Normalization** — cleaning text data in any other way:
  - **Stemming** is a blunt axe to chop off word prefixes and suffixes.
  "booing" and "booed" become "boo", but "computer" may become "comput" and "are" would remain "are."
  - **Lemmatization** is a scalpel to bring words down to their root forms. For example, NLTK's savvy lemmatizer knows "am" and "are" are related to "be."
  - Other common tasks include lowercasing, <a href="https://en.wikipedia.org/wiki/Stop_words" target="_blank">stopwords</a> removal, spelling correction, etc.



Text Preprocessing

You now have a preprocessed, clean list of words. Now what? It may be helpful to know how the words relate to each other and the underlying syntax (grammar). **_Parsing_** is an NLP process concerned with segmenting text based on syntax.

You probably do not want to be doing any parsing by hand and NLTK has a few tricks up its sleeve to help you out: 

**_Part-of-speech tagging (POS tagging)_** identifies parts of speech (verbs, nouns, adjectives, etc.). NLTK can do it faster (and maybe more accurately) than your grammar teacher.

**_Named entity recognition (NER)_** helps identify the proper nouns (e.g., "Natalia" or "Berlin") in a text. This can be a clue as to the topic of the text and NLTK captures many for you. 

**_Dependency grammar_** trees help you understand the relationship between the words in a sentence. It can be a tedious task for a human, so the Python library spaCy is at your service, even if it isn't always perfect. 

In English we leave a lot of ambiguity, so syntax can be tough, even for a computer program. Take a look at the following sentence:

<img src="https://content.codecademy.com/courses/NLP/parsing_syntactic_ambiguity.gif" alt="I saw a cow under a tree with binoculars." width="100%"/>

Do I have the binoculars? Does the cow have binoculars? Does the tree have binoculars?

**_Regex parsing_**, using Python's `re` library, allows for a bit more nuance. When coupled with POS tagging, you can identify specific phrase chunks. On its own, it can find you addresses, emails, and many other common patterns within large chunks of text.



Parsing Text

How can we help a machine make sense of a bunch of word tokens? We can help computers make predictions about language by training a language model on a _corpus_ (a bunch of example text).

**_Language models_** are probabilistic computer models of language. We build and use these models to figure out the likelihood that a given sound, letter, word, or phrase will be used. Once a model has been trained, it can be tested out on new texts.

One of the most common language models is the unigram model, a statistical language model commonly known as **_bag-of-words_**. As its name suggests, bag-of-words does not have much order to its chaos! What it does have is a tally count of each instance for each word. Consider the following text example: 

<img src="https://content.codecademy.com/courses/NLP/bag-of-words.gif" alt="The squids jumped out of the suitcases." width="100%"/>

Provided some initial preprocessing, bag-of-words would result in a mapping like:
```py
{"the": 2, "squid": 1, "jump": 1, "out": 1, "of": 1, "suitcase": 1}
```
Now look at this sentence and mapping: "Why are your suitcases full of jumping squids?"
```py
{"why": 1, "be": 1, "your": 1, "suitcase": 1, "full": 1, "of": 1, "jump": 1, "squid": 1}
```
You can see how even with different word order and sentence structures, "jump," "squid," and "suitcase" are shared topics between the two examples. Bag-of-words can be an excellent way of looking at language when you want to make predictions concerning topic or sentiment of a text. When grammar and word order are irrelevant, this is probably a good model to use.  



Language Models: Bag-of-Words

For parsing entire phrases or conducting language prediction, you will want to use a model that pays attention to each word's neighbors. Unlike bag-of-words, the **_n-gram_** model considers a sequence of some number (_n_) units and calculates the probability of each unit in a body of language given the preceding sequence of length _n_. Because of this, _n_-gram probabilities with larger _n_ values can be impressive at language prediction. 

Take a look at our revised squid example: "The squids jumped out of the suitcases. The squids were furious."

A bigram model (where _n_ is 2) might give us the following count frequencies:
```py
{('', 'the'): 2, ('the', 'squids'): 2, ('squids', 'jumped'): 1, ('jumped', 'out'): 1, ('out', 'of'): 1, ('of', 'the'): 1, ('the', 'suitcases'): 1, ('suitcases', ''): 1, ('squids', 'were'): 1, ('were', 'furious'): 1, ('furious', ''): 1}
```
There are a couple problems with the _n_ gram model:

1. How can your language model make sense of the sentence "The cat fell asleep in the mailbox" if it's never seen the word "mailbox" before? During training, your model will probably come across test words that it has never encountered before (this issue also pertains to bag of words). A tactic known as _language smoothing_ can help adjust probabilities for unknown words, but it isn't always ideal. 

2. For a model that more accurately predicts human language patterns, you want _n_ (your sequence length) to be as large as possible. That way, you will have more natural sounding language, right? Well, as the sequence length grows, the number of examples of each sequence within your training corpus shrinks. With too few examples, you won't have enough data to make many predictions. 

Enter **_neural language models (NLMs)!_** Much recent work within NLP has involved developing and training neural networks to approximate the approach our human brains take towards language. This deep learning approach allows computers a much more adaptive tack to processing human language. Common NLMs include LSTMs and transformer models.


Language Models: N-Gram and NLM

We've touched on the idea of finding topics within a body of language. But what if the text is long and the topics aren't obvious? 

**_Topic modeling_** is an area of NLP dedicated to uncovering latent, or hidden, topics within a body of language. For example, one Codecademy curriculum developer <a href="https://news.codecademy.com/taylor-swift-lyrics-machine-learning/" target="_blank">used topic modeling to discover patterns within Taylor Swift songs</a> related to love and heartbreak over time.

A common technique is to deprioritize the most common words and prioritize less frequently used terms as topics in a process known as **_term frequency-inverse document frequency (tf-idf)_**. Say what?! This may sound counter-intuitive at first. Why would you want to give more priority to less-used words? Well, when you're working with a lot of text, it makes a bit of sense if you don't want your topics filled with words like "the" and "is." The Python libraries `gensim` and `sklearn` have modules to handle tf-idf.

Whether you use your plain bag of words (which will give you term frequency) or run it through tf-idf, the next step in your topic modeling journey is often **_latent Dirichlet allocation (LDA)_**. LDA is a statistical model that takes your documents and determines which words keep popping up together in the same contexts (i.e., documents). We'll use `sklearn` to tackle this for us.

If you have any interest in visualizing your newly minted topics, **_word2vec_** is a great technique to have up your sleeve. word2vec can map out your topic model results spatially as vectors so that similarly used words are closer together. In the case of a language sample consisting of "The squids jumped out of the suitcases. The squids were furious. Why are your suitcases full of jumping squids?", we might see that "suitcase", "jump", and "squid" were words used within similar contexts. This word-to-vector mapping is known as a _word embedding_.



Topic Models

Most of us have a good autocorrect story. Our phone's messenger quietly swaps one letter for another as we type and suddenly the meaning of our message has changed (to our horror or pleasure). However, addressing **_text similarity_** — including spelling correction — is a major challenge within natural language processing. 

Addressing word similarity and misspelling for spellcheck or autocorrect often involves considering the **_Levenshtein distance_** or minimal edit distance between two words. The distance is calculated through the minimum number of insertions, deletions, and substitutions that would need to occur for one word to become another. For example, turning "bees" into "beans" would require one substitution ("a" for "e") and one insertion ("n"), so the Levenshtein distance would be two.

Phonetic similarity is also a major challenge within speech recognition. English-speaking humans can easily tell from context whether someone said "euthanasia" or "youth in Asia," but it's a far more challenging task for a machine! More advanced autocorrect and spelling correction technology additionally considers key distance on a keyboard and **_phonetic similarity_** (how much two words or phrases sound the same). 

It's also helpful to find out if texts are the same to guard against plagiarism, which we can identify through **_lexical similarity_** (the degree to which texts use the same vocabulary and phrases). Meanwhile,  **_semantic similarity_** (the degree to which documents contain similar meaning or topics) is useful when you want to find (or recommend) an article or book similar to one you recently finished. 



Text Similarity

How does your favorite search engine complete your search queries? How does your phone's keyboard know what you want to type next? **_Language prediction_** is an application of NLP concerned with predicting text given preceding text. Autosuggest, autocomplete, and suggested replies are common forms of language prediction.

Your first step to language prediction is picking a language model. Bag of words alone is generally not a great model for language prediction; no matter what the preceding word was, you will just get one of the most commonly used words from your training corpus. 

If you go the _n_-gram route, you will most likely rely on **_Markov chains_** to predict the statistical likelihood of each following word (or character) based on the training corpus. Markov chains are memory-less and make statistical predictions based entirely on the current _n_-gram on hand. 

For example, let's take a sentence beginning, "I ate so many grilled cheese". Using a trigram model (where _n_ is 3), a Markov chain would predict the following word as "sandwiches" based on the number of times the sequence "grilled cheese sandwiches" has appeared in the training data out of all the times "grilled cheese" has appeared in the training data.

A more advanced approach, using a neural language model, is the **_Long Short Term Memory (LSTM)_** model. LSTM uses deep learning with a network of artificial "cells" that manage memory, making them better suited for text prediction than traditional neural networks.



Language Prediction & Text Generation

Believe it or not, you've just scratched the surface of natural language processing. There are a slew of advanced topics and applications of NLP, many of which rely on deep learning and neural networks.

- **_Naive Bayes classifiers_** are supervised machine learning algorithms that leverage a probabilistic theorem to make predictions and classifications. They are widely used for sentiment analysis (determining whether a given block of language expresses negative or positive feelings) and spam filtering.

- We've made enormous gains in **_machine translation_**, but even the most advanced translation software using neural networks and LSTM still has far to go in accurately translating between languages. 

- Some of the most life-altering applications of NLP are focused on improving **_language accessibility_** for people with disabilities. Text-to-speech functionality and speech recognition have improved rapidly thanks to neural language models, making digital spaces far more accessible places.

- NLP can also be used to detect bias in writing and speech. Feel like a political candidate, book, or news source is biased but can't put your finger on exactly how? Natural language processing <a href="https://medium.com/agatha-codes/a-bossy-sort-of-voice-3c3a18de3093" target="_blank">can help you identify the language at issue</a>.



Advanced NLP Topics


As you've seen, there are a vast array of applications for NLP. However, as they say, "with great language processing comes great responsibility" (or something along those lines). When working with NLP, we have several important considerations to take into account:
- Different NLP tasks may be more or less difficult in different languages. Because so many NLP tools are built by and for English speakers, these tools may lag behind in processing other languages. The tools may also be programmed with cultural and linguistic biases specific to English speakers.
- What if your Amazon Alexa could only understand wealthy men from coastal areas of the United States? English itself is not a homogeneous body. English varies by person, by dialect, and by many sociolinguistic factors. When we build and train NLP tools, are we only building them for one type of English speaker?
- You can have the best intentions and still inadvertently program a bigoted tool. While NLP can limit bias, it can also propagate bias. As an NLP developer, it's important to consider biases, both within your code and within the training corpus. A machine will learn the same biases you teach it, whether intentionally or unintentionally. 
- As you become someone who builds tools with natural language processing, it's vital to take into account your users' privacy. There are many powerful NLP tools that come head-to-head with privacy concerns. Who is collecting your data? How much data is being collected and what do those companies plan to do with your data? 



Challenges and Considerations

Check out how much you've learned about natural language processing!
- Natural language processing combines computer science, linguistics, and artificial intelligence to enable computers to process human languages.
- NLTK is a Python library used for NLP.
- Text preprocessing is a stage of NLP focused on cleaning and preparing text for other NLP tasks.
- Parsing is an NLP technique concerned with breaking up text based on syntax.
- Language models are probabilistic machine models of language use for NLP comprehension tasks. Common models include bag-of-words, _n_-gram models, and neural language modeling.
- Topic modeling is the NLP process by which hidden topics are identified given a body of text.
- Text similarity is a facet of NLP concerned with semblance between instances of language.
- Language prediction is an application of NLP concerned with predicting language given preceding language.
- There are many social and ethical considerations to take into account when designing NLP tools.

NLP Review

Getting Started with Natural Language Processing

Delve into the exciting world of Natural Language Processing (NLP) with this overview of major topics in the field.

UPDATE: We've moved NLP content to
https://www.codecademy.com/learn/paths/natural-language-processing

In this book, you will learn how a biased set of search algorithms privilege whiteness and discriminate against people of color, specifically women of color. This is helpful if you want to think about how to mitigate the impact of biased algorithms in your data science work.

Link to the book _Algorithms of Oppression: How Search Engines Reinforce Racism_.

Algorithms of Oppression: How Search Engines Reinforce Racism

text preprocessing topic focused on stemming and lemmatization

print(most_common_language_in_looking_glass)

# prints the following to console:
[(('of', 'the'), 101), (('said', 'the'), 98), (('in', 'a'), 97), (('in', 'the'), 90), (('as', 'she'), 82), (('you', 'know'), 72), (('a', 'little'), 68), (('the', 'queen'), 67), (('said', 'alice'), 67), (('to', 'the'), 66)]

def is_plagiarized(text1, text2):
  n = 7
  if edit_distance(text1.lower(), text2.lower()) > ((len(text1) + len(text2)) / n):
    return False
  return True

Determining the underlying topics of a text

Using Cardi B lyrics to create a new song in her lyrical style.

A robot that nods everytime someone says "right?"

A digital assistant for your phone or home.

Language varies by person, by dialect, and by many sociolinguistic factors.

There are many powerful tools built with natural language processing that come head-to-head with privacy concerns.

Different NLP tasks may be more or less difficult in different languages.

review = "it was aight"

classify_review(review)

# returns the following:
"This review is negative."

The virtual assistant might be trained without accounting for Caribbean Spanish dialects.

Your friend needs to learn Spanish from Spain or Colombia.

Your friend's privacy is being violated by the virtual assistant.

Either filtering the language model for stop words before performing LDA or using a tf-idf model before performing LDA.

Tokenizing the text before performing LDA.

Topics identified in text:

Topic #1: say put you there
Topic #2: there say place man
Topic #3: put man say when
Topic #4: had man one place
Topic #5: know say came see
Topic #6: came say place man
Topic #7: say could one there
Topic #8: place come say one
Topic #9: say man when case
Topic #10: say point man one

How well have you processed your natural language processing know-how? Test your introductory NLP knowledge here.

Getting Started with Natural Language Processing Quiz

### About this course
From your virtual assistant recommending a restaurant to that terrible autocorrect you sent your parents, natural language processing (NLP) is a rapidly growing presence in our lives. NLP is all about how computers work with human language.
*Don’t just use NLP tools — make them!*

### Skills you'll gain
* Use basic text pre-processing techniques
* Gain familiarity with various types of language models
* Use topic modeling to understand a text

Humans communicate with language, but computers communicate with data. Discover how to translate between the two in this course.

PRO SALE: Get 50% off annual Pro memberships using code [LLM50](https://www.codecademy.com/checkout?plan_id=proGoldAnnualV2&discountCode=LLM50&plan_type=pro)