In natural language processing, tokenization is the text preprocessing task of breaking up text into smaller components of text (known as tokens).
from nltk.tokenize import word_tokenizetext = "This is a text to tokenize"tokenized = word_tokenize(text)print(tokenized)# ["This", "is", "a", "text", "to", "tokenize"]
In natural language processing, text preprocessing is the practice of cleaning and preparing text data. NLTK and
re are common Python libraries used to handle many text preprocessing tasks.
In a pandas
DataFrame, aggregate statistic functions can be applied across multiple rows by using a
groupby function. In the example, the code takes all of the elements that are the same in
Name and groups them, replacing the values in
Grade with their mean. Instead of
mean() any aggregate statistics function, like
max(), can be used. Note that to use the
groupby() function, at least two columns must be supplied.
df = pd.DataFrame([["Amy","Assignment 1",75],["Amy","Assignment 2",35],["Bob","Assignment 1",99],["Bob","Assignment 2",35]], columns=["Name", "Assignment", "Grade"])df.groupby('Name').Grade.mean()# output of the groupby command|Name | Grade|| - | - ||Amy | 55||Bob | 67|
Sentiment Analysis is the process of programmatically labelling text as positive, negative, or neutral. NLTK has a pre-built sentiment classification model that will return positive, negative, neutral, and compound values. Compound is the normalized sum of positive and negative.
from nltk.sentiment import SentimentIntensityAnalyzersia = SentimentIntensityAnalyzer()sia.polarity_scores("I love Taylor Swift!")