Now that we understand how term frequency and inverse document frequency are calculated, let’s put it all together to calculate tf-idf!
Tf-idf scores are calculated on a term-document basis. That means there is a tf-idf score for each word, for each document. The tf-idf score for some term
t in a document
d in some
corpus is calculated as follows:
tf(t,d)is the term frequency of term
idf(t,corpus)is the inverse document frequency of a term
We can easily calculate the tf-idf values for each term-document pair in our corpus using scikit-learn’s
vectorizer = TfidfVectorizer(norm=None) tfidf_vectorizer = vectorizer.fit_transform(corpus)
TfidfVectorizerobject is initialized. The
norm=Nonekeyword argument prevents scikit-learn from modifying the multiplication of term frequency and inverse document frequency
TfidfVectorizerobject is fit and transformed on the corpus of data, returning the tf-idf scores for each term-document pair
The same selection of 6 Emily Dickinson poems from the previous exercise is given in poems.py.
In script.py, the poems are preprocessed. Let’s calculate the tf-idf scores for each term-document pair.
Begin by creating a
TfidfVectorizer object named
vectorizer with keyword argument
Fit and transform your
vectorizer on the corpus of preprocessed poems. Save the result to a variable named
TfidfVectorizer objects have a
.get_feature_names() method which returns a list of all the unique terms in the corpus.
Paste the below line into the “get vocabulary of terms” section of script.py to display the tf-idf matrix.
feature_names = vectorizer.get_feature_names()
Which term-document pairs have the highest tf-idf scores?