Just like we can use a scatter plot to examine the relationship between two numeric variables, we can use distribution plots to examine a numeric variable’s distribution of values.
Histograms
The most basic way to plot our data is to create a histogram. A histogram looks like a bar chart, but instead of having a bar for each category of a variable, it has a bar for sets of numeric values called bins. The height of the bar shows how many data points of the variable fall within that bin’s range of values.
We can create a histogram of total_sales
from our restaurant dataset df
using the seaborn function sns.histplot()
.
sns.histplot(data=df, x='sales_totals')
This code will display a histogram with vertical bars. Using y
instead of x
will create a histogram with horizontal bars.
Seaborn sets the bins
parameter to auto
by default, but we can change the binning of values in a number of ways.
- Number of bins: an integer for the number of bins to fit the data to
- Bin breaks: a list of values for where bins should start and end
- Reference rule: the name of a method to compute the optimal bin width, including
auto
(the larger of thesturges
andfd
reference rules)
Note that poorly chosen bin sizes can distort histograms, making it difficult to understand the histogram’s underlying data.
KDE plots
Another option for displaying a distribution is a kernel density estimation (KDE) plot. A KDE plot displays a continuous probability density curve for the distribution. This estimation looks a lot like a smoothed version of a histogram.
We can create a KDE plot of total_sales
using kdeplot()
. We can also set the optional parameter fill
to True
so that the plot will be shaded below the KDE curve.
sns.kdeplot(data=df, x='sales_totals', fill=True)
Like histograms, using y
instead of x
will create a horizontal orientation.
Box plots
Finally, let’s look at a plot that displays distributions for each category of a second variable. The box plot communicates specific information about each category’s distribution through a pattern of lines and a box, as shown in the following diagram:
Note: seaborn will create a horizontal box plot by default but will create a vertical box plot like the previous diagram if given the y
parameter instead of x
.
If we want to see a distribution of total_sales
for each day
of the week, we can use sns.boxplot()
as shown in the following code.
sns.boxplot(data=df, x='sales_totals', y='day')
Swapping the x
and y
parameters will change the orientation of the plot.
Instructions
Run all initial code cells. Then create a histogram of the municipal solid waste (msw
) of countries in the waste
dataset. Do not specify bins
parameter. The number of bins will be calculated automatically by seaborn.
Let’s see how the shape of the previous histogram changes when we decrease the number of bins. Repeat the plot from question 1 but set the bins
parameter to 5.
Now let’s see what the same distribution looks like when we use a KDE plot. Plot msw
in a KDE plot with shading below the curve.
Let’s visualize more detailed information like the median, quartiles, and outliers. Create a box plot of msw
.
Finally, let’s add a little more complexity to our box plot by displaying the msw
distributions of countries from different income levels. Repeat the plot from question 4 but add income
as the y
parameter.