Creating a Word Cloud With Python
In Data Visualization, word clouds are used to display textual data in a specific shape. The more frequent a word is in a chunk of text, the larger it will appear in the word cloud. This can provide interesting insights in datasets, such as determining which metatags have the most prominence on a web page. Other uses for word clouds include displaying what topics are usually covered in speeches and excerpts.
In this article, we will learn how to create, mask, and display the following word cloud in Python:
It’s a word cloud of the late and great David Bowie filled with words from his songs with “star” in the lyrics!
We will use the following libraries:
- The wordcloud module will generate our word cloud using a text file.
- NumPy will be used to mask, or shape, an image to be applied to the word cloud.
- Pillow, a fork of Python Imaging Library (PIL), will be used for making a copy of the original image to work with.
- Modules from the
oslibrary, including the
os.pathmodule, will allow us to access the folders and files in our project’s directory.
- Matplotlib will ultimately be used to display our new word cloud.
Note: With the exception of the
oslibrary, the rest will need to be installed to complete this tutorial. Using pip version 3, the following command be run to install them:pip3 install numpy pillow wordcloud matplotlib
This tutorial assumes familiarity with the Python programming language, including importing libraries and using some of Python’s built-in functions.
Let’s get started with the first step!
Step 1: Project setup
To begin, we will create the root directory for our project. After launching the terminal or command prompt, we’ll run the following commands to create a new
my_wordcloud directory and change to it:
mkdir my_wordcloudcd my_wordcloud
We’re now working from the newly created
my_wordcloud directory. Next, let’s create a new file called
ls command confirms we successfully created the
my_wordcloud.py file in the correct directory:
Note: The file that will ultimately import the
wordcloudlibrary must not be named
wordcloud.py. Otherwise, when executing the file, errors will be thrown due to circular imports.
For this project, we’ll need to have access to the file path of our current working directory while creating the word cloud. For this step, let’s open the
my_wordcloud.py file and write the following:
import oscurrent_directory = os.path.dirname(__file__)
In the snippet above, we passed the
__file__ path attribute to the
os.path.dirname() method to return the full path string of the parent directory
Let’s save the file and proceed to the next step.
Step 2: Find an image to mask with
Finding a suitable image for the word cloud can be challenging at first. There are a few criteria to keep in mind when searching for one:
- The image must have a white background (
#ffffff); anything off-white or transparent will be populated by words for the word cloud.
- The shape(s) within the image should be well-defined and composed of mostly non-white colors.
- Although this tutorial uses
.pngimages, as long as the image is small, a
.jpgimage file can be used instead.
Googling for “word cloud mask images” will yield some promising results. For this tutorial, we are going to use the following image:
After downloading the image, let’s rename it to
bowie.png and save it in our
my_wordcloud parent directory.
Next, let’s assign the full path of our
bowie.png file to a variable called
bowie_image_path. Then, we will import the
Image module from the
PIL library to create an object representation of our image with the file path:
import osfrom PIL import Image…bowie_image_path = os.path.join(current_directory, "bowie.png")bowie_image = Image.open(bowie_image_path)
Image.open()method is not to be confused with Python’s built-in
With our image now stored inside the variable
bowie_image, let’s save the
my_wordcloud.py and move to the next step.
Step 3: Create mask with image path
For this step, we will import and use NumPy to create the “mask” for our image. By this, we mean we are going to shape out the part of the image we’d like to generate a word cloud from.
We will create an ndarray of pixel values that indicate which ones should not be included in the mask (255 represents “white”) and which ones should be included; all other colors are fair game. Using the
bowie_image we made in the last step, let’s add the following to the
import osfrom PIL import Imageimport numpy as np…bowie_mask = np.array(bowie_image)print(bowie_mask)
Here, we imported the
numpy library and can access its methods with the
np alias. Then, we invoked the
np.array() method to create a new
Let’s run our program thus far with the
python3 my_wordcloud.py command. The print() statement will check that our
bowie_mask array was created correctly. We should see (mostly) values of 255, confirming that our image had a proper white background and the mask can be applied to our word cloud.
Let’s remove the
print() statement, save the
my_wordcloud.py file, and head to the next step.
Step 4: Find text to generate
In this step, we are going to find some text to use for generating our word cloud. All we need to generate a word cloud is a single string of text. The string can be locally defined in the same
.py file or it can be saved to an external file and accessed from there.
For this tutorial, we will use a text (
.txt) file filled with lyrics from Bowie songs featuring the word “star.” They can be accessed here.
Let’s create a new file called
bowie_star_lyrics.txt, copy/paste the lyrics into it, and save the file.
Then, inside the
my_wordcloud.py file, let’s add the following code:
with open(os.path.join(current_directory, "bowie_star_lyrics.txt")) as f:lyrics = f.read()
We did the following in the snippet above:
- We used the
os.path.join()method to return the full path string for our
- The built-in open() function created an object representation of our file with the path string.
- Inside the
withblock, the contents of the file were accessed with the .read() method and assigned to a variable called
- Lastly, the file was closed after the program exited the
Let’s save our
my_wordcloud.py file once more and go to the next step.
Step 5: Create and generate word cloud
It is now time to begin building the word cloud! To start, we will import the following from the
wordcloud library like so:
import osfrom PIL import Imageimport numpy as npfrom wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
WordCloudare both case-sensitive.
WordCloud class contains all of the methods we need for generating the word cloud. The
STOPWORDS property filters out superfluous filler words like “an”, “and”, “the”, and “they.” More words can be added with the
.add() method if desired. The
ImageColorGenerator will be utilized later on in this step.
Let’s now create an instance of the
WordCloud class and assign to a variable with the following code:
from wordcloud import WordCloud, ImageColorGenerator…wordcloud = WordCloud(background_color="white", mask=bowie_mask, collocations=False, stopwords=STOPWORDS, contour_color="white", contour_width=1)
We did the following in the snippet above:
bowie_maskwe previously made was assigned to the
maskparameter to “shape” our word cloud.
- We also assigned the
STOPWORDSproperty to the (lowercase)
- The default
background_colorof a word cloud is black, so we changed it to
collocationsparameter was set to
Falseto break up word combinations and better distribute the text throughout the word cloud.
- We outlined our word cloud with the
Next, we will invoke the
.generate() with our
wordcloud variable, passing in the
lyrics string we defined earlier in this tutorial:
wordcloud = WordCloud(background_color="white", mask=bowie_mask, collocations=False, stopwords=STOPWORDS, contour_color="white", contour_width=1)wordcloud.generate(lyrics)print(wordcloud)
This will populate our word cloud with the words read from the
bowie_star_lyrics.txt file, without the
If we save the file, run
python3 my_wordcloud.py, and confirm our
wordcloud variable is stored in memory, then the word cloud was created and generated correctly.
<wordcloud.wordcloud.WordCloud object at 0x7fb404433be0>
After confirming our
wordcloud was created, let’s remove the
The last matter we’ll cover in this step is attaching a color generator to our word cloud. Let’s add the following to the
wordcloud.generate(lyrics)image_colors = ImageColorGenerator(bowie_mask)
Note: An RGB image must be passed to the
image_colors object maps the color(s) of the generated words to the original image as closely as possible.
my_wordcloud.py file, and head to the final step.
Step 6: Show word cloud on pyplot figure
The last library we will use for this tutorial will be Matplotlib. The methods from the library’s
pyplot interface will display our new word cloud. Let’s first import
matplotlib.pyplot at the top of our
import osfrom PIL import Imageimport numpy as npfrom wordcloud import WordCloud, ImageColorGeneratorimport matplotlib.pyplot as plt
Like with NumPy, we use the alias
plt as a shorthand for tools we imported. Let’s now add the following to our file:
In the snippet above, we used the following methods:
.imshow()method creates and draws the figure on which our
wordcloudis placed; the
"bilinear"setting helps smooth out the image.
- We turned off the labels of the x- and y-axes with
- Lastly, the figure with our word cloud was displayed with the
Let’s save our
my_wordcloud.py file and run it with
python3 my_wordcloud.py. The following figure window should appear on our screen:
The word cloud figure should render onto the screen within a few seconds after running the file. The generated lyrics should be shaped like the Bowie image.
The last task in this tutorial is to apply coloring to the word cloud with the
image_colors object. Let’s close the figure window and add the following change to the
.recolor() method will display our
wordcloud with the same colors as the original image. If we save our
my_wordcloud.md file and re-run, we should see the following figure:
We have successfully rendered our first word cloud! All in all, we were able to generate a word cloud with less than 40 lines of code.
In this tutorial, we learned how to utilize various Python libraries and modules to create, mask, generate, and display a word cloud.