Text preprocessing is an approach for cleaning and preparing text data for use in a specific context. Developers use it in almost all natural language processing (NLP) pipelines, including voice recognition software, search engine lookup, and machine learning model training. It is an essential step because text data can vary. From its format (website, text message, voice recognition) to the people who create the text (language, dialect), there are plenty of things that can introduce noise into your data.
The ultimate goal of cleaning and preparing text data is to reduce the text to only the words that you need for your NLP goals.
In this lesson, you will learn strategies for preparing text data. While this list is not exhaustive, we will cover a few common approaches for cleaning and processing text data. They include:
- Using Regex & NLTK libraries
- Removing unnecessary characters and formatting
- Tokenization – break multi-word strings into smaller components
- Normalization – a catch-all term for processing data; this includes stemming and lemmatization
In the gif to the right, you can see an example of using noise removal, tokenization, and lemmatization to change the string “Who was partying?” into a list with the words “who”, “be”, and “party”.
In this lesson, you will learn how to use built-in and NLTK functions to apply these same text preprocessing approaches to your own strings.