Creating a Word Cloud With Python

Codecademy Team
Learn how to use various Python libraries to create, mask, and display a word cloud with contents from a text file.

Introduction

In Data Visualization, word clouds are used to display textual data in a specific shape. The more frequent a word is in a chunk of text, the larger it will appear in the word cloud. This can provide interesting insights in datasets, such as determining which metatags have the most prominence on a web page. Other uses for word clouds include displaying what topics are usually covered in speeches and excerpts.

In this article, we will learn how to create, mask, and display the following word cloud in Python:

Final version of Bowie word cloud

It’s a word cloud of the late and great David Bowie filled with words from his songs with “star” in the lyrics!

We will use the following libraries:

  • The wordcloud module will generate our word cloud using a text file.
  • NumPy will be used to mask, or shape, an image to be applied to the word cloud.
  • Pillow, a fork of Python Imaging Library (PIL), will be used for making a copy of the original image to work with.
  • Modules from the os library, including the os.path module, will allow us to access the folders and files in our project’s directory.
  • Matplotlib will ultimately be used to display our new word cloud.

Note: With the exception of the os library, the rest will need to be installed to complete this tutorial. Using pip version 3, the following command be run to install them:

pip3 install numpy pillow wordcloud matplotlib

This tutorial assumes familiarity with the Python programming language, including importing libraries and using some of Python’s built-in functions.

Let’s get started with the first step!

Step 1: Project setup

To begin, we will create the root directory for our project. After launching the terminal or command prompt, we’ll run the following commands to create a new my_wordcloud directory and change to it:

mkdir my_wordcloud
cd my_wordcloud

We’re now working from the newly created my_wordcloud directory. Next, let’s create a new file called my_wordcloud.py:

touch my_wordcloud.py
ls

Running the ls command confirms we successfully created the my_wordcloud.py file in the correct directory:

my_wordcloud.py

Note: The file that will ultimately import the wordcloud library must not be named wordcloud.py. Otherwise, when executing the file, errors will be thrown due to circular imports.

For this project, we’ll need to have access to the file path of our current working directory while creating the word cloud. For this step, let’s open the my_wordcloud.py file and write the following:

import os
current_directory = os.path.dirname(__file__)

In the snippet above, we passed the __file__ path attribute to the os.path.dirname() method to return the full path string of the parent directory my_wordcloud.

Let’s save the file and proceed to the next step.

Step 2: Find an image to mask with

Finding a suitable image for the word cloud can be challenging at first. There are a few criteria to keep in mind when searching for one:

  • The image must have a white background (#ffffff); anything off-white or transparent will be populated by words for the word cloud.
  • The shape(s) within the image should be well-defined and composed of mostly non-white colors.
  • Although this tutorial uses .png images, as long as the image is small, a .jpg image file can be used instead.

Googling for “word cloud mask images” will yield some promising results. For this tutorial, we are going to use the following image:

Original Bowie image

After downloading the image, let’s rename it to bowie.png and save it in our my_wordcloud parent directory.

Next, let’s assign the full path of our bowie.png file to a variable called bowie_image_path. Then, we will import the Image module from the PIL library to create an object representation of our image with the file path:

import os
from PIL import Image
bowie_image_path = os.path.join(current_directory, "bowie.png")
bowie_image = Image.open(bowie_image_path)

Note: The Image.open() method is not to be confused with Python’s built-in open() function.

With our image now stored inside the variable bowie_image, let’s save the my_wordcloud.py and move to the next step.

Step 3: Create mask with image path

For this step, we will import and use NumPy to create the “mask” for our image. By this, we mean we are going to shape out the part of the image we’d like to generate a word cloud from.

We will create an ndarray of pixel values that indicate which ones should not be included in the mask (255 represents “white”) and which ones should be included; all other colors are fair game. Using the bowie_image we made in the last step, let’s add the following to the my_wordcloud.py file:

import os
from PIL import Image
import numpy as np
bowie_mask = np.array(bowie_image)
print(bowie_mask)

Here, we imported the numpy library and can access its methods with the np alias. Then, we invoked the np.array() method to create a new ndarray called bowie_mask.

Let’s run our program thus far with the python3 my_wordcloud.py command. The print() statement will check that our bowie_mask array was created correctly. We should see (mostly) values of 255, confirming that our image had a proper white background and the mask can be applied to our word cloud.

Let’s remove the print() statement, save the my_wordcloud.py file, and head to the next step.

Step 4: Find text to generate

In this step, we are going to find some text to use for generating our word cloud. All we need to generate a word cloud is a single string of text. The string can be locally defined in the same .py file or it can be saved to an external file and accessed from there.

For this tutorial, we will use a text (.txt) file filled with lyrics from Bowie songs featuring the word “star.” They can be accessed here.

Let’s create a new file called bowie_star_lyrics.txt, copy/paste the lyrics into it, and save the file.

Then, inside the my_wordcloud.py file, let’s add the following code:

with open(os.path.join(current_directory, "bowie_star_lyrics.txt")) as f:
lyrics = f.read()

We did the following in the snippet above:

  • We used the os.path.join() method to return the full path string for our bowie_star_lyrics.txt file.
  • The built-in open() function created an object representation of our file with the path string.
  • Inside the with block, the contents of the file were accessed with the .read() method and assigned to a variable called lyrics.
  • Lastly, the file was closed after the program exited the with block.

Let’s save our my_wordcloud.py file once more and go to the next step.

Step 5: Create and generate word cloud

It is now time to begin building the word cloud! To start, we will import the following from the wordcloud library like so:

import os
from PIL import Image
import numpy as np
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

Note: STOPWORDS and WordCloud are both case-sensitive.

The WordCloud class contains all of the methods we need for generating the word cloud. The STOPWORDS property filters out superfluous filler words like “an”, “and”, “the”, and “they.” More words can be added with the .add() method if desired. The ImageColorGenerator will be utilized later on in this step.

Let’s now create an instance of the WordCloud class and assign to a variable with the following code:

from wordcloud import WordCloud, ImageColorGenerator
wordcloud = WordCloud(background_color="white", mask=bowie_mask, collocations=False, stopwords=STOPWORDS, contour_color="white", contour_width=1)

We did the following in the snippet above:

  • The bowie_mask we previously made was assigned to the mask parameter to “shape” our word cloud.
  • We also assigned the STOPWORDS property to the (lowercase) stopwords parameter.
  • The default background_color of a word cloud is black, so we changed it to "white".
  • The collocations parameter was set to False to break up word combinations and better distribute the text throughout the word cloud.
  • We outlined our word cloud with the contour_color and contour_width parameters.

Next, we will invoke the .generate() with our wordcloud variable, passing in the lyrics string we defined earlier in this tutorial:

wordcloud = WordCloud(background_color="white", mask=bowie_mask, collocations=False, stopwords=STOPWORDS, contour_color="white", contour_width=1)
wordcloud.generate(lyrics)
print(wordcloud)

This will populate our word cloud with the words read from the bowie_star_lyrics.txt file, without the STOPWORDS.

If we save the file, run python3 my_wordcloud.py, and confirm our wordcloud variable is stored in memory, then the word cloud was created and generated correctly.

<wordcloud.wordcloud.WordCloud object at 0x7fb404433be0>

After confirming our wordcloud was created, let’s remove the print() statement.

The last matter we’ll cover in this step is attaching a color generator to our word cloud. Let’s add the following to the my_wordcloud.py file:

wordcloud.generate(lyrics)
image_colors = ImageColorGenerator(bowie_mask)

Note: An RGB image must be passed to the ImageColorGenerator constructor.

The image_colors object maps the color(s) of the generated words to the original image as closely as possible.

save the my_wordcloud.py file, and head to the final step.

Step 6: Show word cloud on pyplot figure

The last library we will use for this tutorial will be Matplotlib. The methods from the library’s pyplot interface will display our new word cloud. Let’s first import matplotlib.pyplot at the top of our my_wordcloud.py file:

import os
from PIL import Image
import numpy as np
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt

Like with NumPy, we use the alias plt as a shorthand for tools we imported. Let’s now add the following to our file:

plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In the snippet above, we used the following methods:

  • The .imshow() method creates and draws the figure on which our wordcloud is placed; the "bilinear" setting helps smooth out the image.
  • We turned off the labels of the x- and y-axes with .axis("off").
  • Lastly, the figure with our word cloud was displayed with the .show() method.

Let’s save our my_wordcloud.py file and run it with python3 my_wordcloud.py. The following figure window should appear on our screen:

Unfinished Bowie word cloud

The word cloud figure should render onto the screen within a few seconds after running the file. The generated lyrics should be shaped like the Bowie image.

The last task in this tutorial is to apply coloring to the word cloud with the image_colors object. Let’s close the figure window and add the following change to the plt.imshow() method:

plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear")
plt.axis("off")
plt.show()

The .recolor() method will display our wordcloud with the same colors as the original image. If we save our my_wordcloud.md file and re-run, we should see the following figure:

Final version of Bowie word cloud

Conclusion

We have successfully rendered our first word cloud! All in all, we were able to generate a word cloud with less than 40 lines of code.

In this tutorial, we learned how to utilize various Python libraries and modules to create, mask, generate, and display a word cloud.

Source code for this tutorial