Term Frequency–Inverse Document Frequency
Putting It All Together: Tf-idf

Now that we understand how term frequency and inverse document frequency are calculated, let’s put it all together to calculate tf-idf!

Tf-idf scores are calculated on a term-document basis. That means there is a tf-idf score for each word, for each document. The tf-idf score for some term t in a document d in some corpus is calculated as follows:

tfidf(t,d)=tf(t,d)idf(t,corpus)tfidf(t,d) = tf(t,d)*idf(t,corpus)
  • tf(t,d) is the term frequency of term t in document d
  • idf(t,corpus) is the inverse document frequency of a term t across corpus

We can easily calculate the tf-idf values for each term-document pair in our corpus using scikit-learn’s TfidfVectorizer:

vectorizer = TfidfVectorizer(norm=None) tfidf_vectorizer = vectorizer.fit_transform(corpus)
  • a TfidfVectorizer object is initialized. The norm=None keyword argument prevents scikit-learn from modifying the multiplication of term frequency and inverse document frequency
  • the TfidfVectorizer object is fit and transformed on the corpus of data, returning the tf-idf scores for each term-document pair



The same selection of 6 Emily Dickinson poems from the previous exercise is given in poems.py.

In script.py, the poems are preprocessed. Let’s calculate the tf-idf scores for each term-document pair.

Begin by creating a TfidfVectorizer object named vectorizer with keyword argument norm=None.


Fit and transform your vectorizer on the corpus of preprocessed poems. Save the result to a variable named tfidf_scores.


Like CountVectorizer objects, TfidfVectorizer objects have a .get_feature_names() method which returns a list of all the unique terms in the corpus.

Paste the below line into the “get vocabulary of terms” section of script.py to display the tf-idf matrix.

feature_names = vectorizer.get_feature_names()

Which term-document pairs have the highest tf-idf scores?

Folder Icon

Sign up to start coding

Already have an account?