Tokenization and noise removal are staples of almost all text pre-processing pipelines. However, some data may require further processing through text normalization. Text normalization is a catch-all term for various text pre-processing tasks. In the next few exercises, we’ll cover a few of them:

  • Upper or lowercasing
  • Stopword removal
  • Stemming – bluntly removing prefixes and suffixes from a word
  • Lemmatization – replacing a single-word token with its root

The simplest of these approaches is to change the case of a string. We can use Python’s built-in String methods to make a string all uppercase or lowercase:

my_string = 'tHiS HaS a MiX oF cAsEs' print(my_string.upper()) # 'THIS HAS A MIX OF CASES' print(my_string.lower()) # 'this has a mix of cases'



