Codecademy Logo

Analyze Taylor Swift Lyrics

Tokenization

In natural language processing, tokenization is the text preprocessing task of breaking up text into smaller components of text (known as tokens).

from nltk.tokenize import word_tokenize
text = "This is a text to tokenize"
tokenized = word_tokenize(text)
print(tokenized)
# ["This", "is", "a", "text", "to", "tokenize"]

Text Preprocessing

In natural language processing, text preprocessing is the practice of cleaning and preparing text data. NLTK and re are common Python libraries used to handle many text preprocessing tasks.

Pandas’ Groupby

In a pandas DataFrame, aggregate statistic functions can be applied across multiple rows by using a groupby function. In the example, the code takes all of the elements that are the same in Name and groups them, replacing the values in Grade with their mean. Instead of mean() any aggregate statistics function, like median() or max(), can be used. Note that to use the groupby() function, at least two columns must be supplied.

df = pd.DataFrame([
["Amy","Assignment 1",75],
["Amy","Assignment 2",35],
["Bob","Assignment 1",99],
["Bob","Assignment 2",35]
], columns=["Name", "Assignment", "Grade"])
df.groupby('Name').Grade.mean()
# output of the groupby command
|Name | Grade|
| - | - |
|Amy | 55|
|Bob | 67|

Sentiment Analysis

Sentiment Analysis is the process of programmatically labelling text as positive, negative, or neutral. NLTK has a pre-built sentiment classification model that will return positive, negative, neutral, and compound values. Compound is the normalized sum of positive and negative.

from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
sia.polarity_scores("I love Taylor Swift!")

Learn More on Codecademy