Learn

Histograms let us visualize the distribution of a continuous variable, in contrast to bar plots which show counts and other values for discrete and categorical variables. Histograms divide values of a variable into bins, which are ranges of values that get counted together. For example, if a variable had values 1 through 100 and we specify that we want 5 bins, each bin would have a range of 100 / 5 = 20. The first bin would count the frequency of values 1 to 20, the second bin would count the frequency of values 21 to 40, and so on.

We can construct a histogram using geom_histogram(). The code below creates a histogram using R’s built-in airquality dataset containing atmospheric measurements from New York City. This histogram shows frequencies of Ozone values, measuring the amount of air pollution recorded within a given time period.

airquality_histogram <- ggplot(airquality, aes(x = Ozone)) + labs(title = "Air Quality: Ozone Distribution") + geom_histogram()

This produces the following plot. We see that ozone levels are clustered towards the lower end of the range (a good thing!), though there were days with much higher ozone levels as well.

Histogram

By default, ggplot2 automatically calculates 30 equally sized bins. Frequently we’ll want to specify a range per bin that better fits our data; for example, if we wanted to examine the distribution of weight in pounds for a population of house cats, it would make sense for each bin to represent one pound rather than some arbitrary decimal amount. We can set the width of bins using the binwidth argument. The code below creates the same plot as before, now with a binwidth of 10.

airquality_histogram_binwidth <- ggplot(airquality, aes(x = Ozone)) + labs(title = "Air Quality: Ozone Distribution") + geom_histogram(binwidth = 10)

Take a look at our new plot with binwidth set to 10. Notice how the shape of the histogram is now more smooth with fewer local peaks.

Histogram

Instructions

1.

Our workspace contains a dataset called rideshare_df describing rideshare trips in the city of Chicago. Examine this dataset by calling the head() function with rideshare_df as an argument. Click through the arrows in the table header to see all of the columns in this data frame.

2.

Lets visualize the distribution of total trip cost across the rideshare_df dataset. Construct a histogram called rideshare_histogram with the Trip.Total variable on the x axis. Note that we only supply the x variable in our aes() mapping because the y axis will automatically show frequency counts.

Print your plot after creating it to see what it looks like!

3.

Create a similar plot called rideshare_histogram_binwidth, this time setting the binwidth to 5 to count trip totals in intervals of $5.

Print your plot again to see how it looks with our custom bin width.

Sign up to start coding

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?