In natural language processing, tokenization is the text preprocessing task of breaking up text into smaller components of text (known as tokens).
from nltk.tokenize import word_tokenizetext = "This is a text to tokenize"tokenized = word_tokenize(text)print(tokenized)# ["This", "is", "a", "text", "to", "tokenize"]
In natural language processing, text preprocessing is the practice of cleaning and preparing text data. NLTK and re
are common Python libraries used to handle many text preprocessing tasks.
In a pandas DataFrame
, aggregate statistic functions can be applied across multiple rows by using a groupby
function. In the example, the code takes all of the elements that are the same in Name
and groups them, replacing the values in Grade
with their mean. Instead of mean()
any aggregate statistics function, like median()
or max()
, can be used. Note that to use the groupby()
function, at least two columns must be supplied.
df = pd.DataFrame([["Amy","Assignment 1",75],["Amy","Assignment 2",35],["Bob","Assignment 1",99],["Bob","Assignment 2",35]], columns=["Name", "Assignment", "Grade"])df.groupby('Name').Grade.mean()# output of the groupby command|Name | Grade|| - | - ||Amy | 55||Bob | 67|
Sentiment Analysis is the process of programmatically labelling text as positive, negative, or neutral. NLTK has a pre-built sentiment classification model that will return positive, negative, neutral, and compound values. Compound is the normalized sum of positive and negative.
from nltk.sentiment import SentimentIntensityAnalyzersia = SentimentIntensityAnalyzer()sia.polarity_scores("I love Taylor Swift!")