Learn

Just like we can use a scatter plot to examine the relationship between two numeric variables, we can use distribution plots to examine a numeric variable’s distribution of values.

Histograms

The most basic way to plot our data is to create a histogram. A histogram looks like a bar chart, but instead of having a bar for each category of a variable, it has a bar for sets of numeric values called bins. The height of the bar shows how many data points of the variable fall within that bin’s range of values.

We can create a histogram of total_sales from our restaurant dataset df using the seaborn function sns.histplot().

sns.histplot(data=df, x='sales_totals')

This code will display a histogram with vertical bars. Using y instead of x will create a histogram with horizontal bars.

Seaborn sets the bins parameter to auto by default, but we can change the binning of values in a number of ways.

  • Number of bins: an integer for the number of bins to fit the data to
  • Bin breaks: a list of values for where bins should start and end
  • Reference rule: the name of a method to compute the optimal bin width, including auto (the larger of the sturges and fd reference rules)

Note that poorly chosen bin sizes can distort histograms, making it difficult to understand the histogram’s underlying data.

KDE plots

Another option for displaying a distribution is a kernel density estimation (KDE) plot. A KDE plot displays a continuous probability density curve for the distribution. This estimation looks a lot like a smoothed version of a histogram.

We can create a KDE plot of total_sales using kdeplot(). We can also set the optional parameter fill to True so that the plot will be shaded below the KDE curve.

sns.kdeplot(data=df, x='sales_totals', fill=True)

Like histograms, using y instead of x will create a horizontal orientation.

Box plots

Finally, let’s look at a plot that displays distributions for each category of a second variable. The box plot communicates specific information about each category’s distribution through a pattern of lines and a box, as shown in the following diagram:

Diagram of a box plot showing median, outliers, quartiles, minimum, and maximum.

Note: seaborn will create a horizontal box plot by default but will create a vertical box plot like the previous diagram if given the y parameter instead of x.

If we want to see a distribution of total_sales for each day of the week, we can use sns.boxplot() as shown in the following code.

sns.boxplot(data=df, x='sales_totals', y='day')

Swapping the x and y parameters will change the orientation of the plot.

Instructions

1.

Run all initial code cells. Then create a histogram of the municipal solid waste (msw) of countries in the waste dataset. Do not specify bins parameter. The number of bins will be calculated automatically by seaborn.

2.

Let’s see how the shape of the previous histogram changes when we decrease the number of bins. Repeat the plot from question 1 but set the bins parameter to 5.

3.

Now let’s see what the same distribution looks like when we use a KDE plot. Plot msw in a KDE plot with shading below the curve.

4.

Let’s visualize more detailed information like the median, quartiles, and outliers. Create a box plot of msw.

5.

Finally, let’s add a little more complexity to our box plot by displaying the msw distributions of countries from different income levels. Repeat the plot from question 4 but add income as the y parameter.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?