Histograms let us visualize the distribution of a continuous variable, in contrast to bar plots which show counts and other values for discrete and categorical variables. Histograms divide values of a variable into bins, which are ranges of values that get counted together. For example, if a variable had values
100 and we specify that we want 5 bins, each bin would have a range of
100 / 5 = 20. The first bin would count the frequency of values
20, the second bin would count the frequency of values
40, and so on.
We can construct a histogram using
geom_histogram(). The code below creates a histogram using R’s built-in
airquality dataset containing atmospheric measurements from New York City. This histogram shows frequencies of
Ozone values, measuring the amount of air pollution recorded within a given time period.
airquality_histogram <- ggplot(airquality, aes(x = Ozone)) + labs(title = "Air Quality: Ozone Distribution") + geom_histogram()
This produces the following plot. We see that ozone levels are clustered towards the lower end of the range (a good thing!), though there were days with much higher ozone levels as well.
ggplot2 automatically calculates 30 equally sized bins. Frequently we’ll want to specify a range per bin that better fits our data; for example, if we wanted to examine the distribution of weight in pounds for a population of house cats, it would make sense for each bin to represent one pound rather than some arbitrary decimal amount. We can set the width of bins using the
binwidth argument. The code below creates the same plot as before, now with a
airquality_histogram_binwidth <- ggplot(airquality, aes(x = Ozone)) + labs(title = "Air Quality: Ozone Distribution") + geom_histogram(binwidth = 10)
Take a look at our new plot with
binwidth set to
10. Notice how the shape of the histogram is now more smooth with fewer local peaks.
Our workspace contains a dataset called
rideshare_df describing rideshare trips in the city of Chicago. Examine this dataset by calling the
head() function with
rideshare_df as an argument. Click through the arrows in the table header to see all of the columns in this data frame.
Lets visualize the distribution of total trip cost across the
rideshare_df dataset. Construct a histogram called
rideshare_histogram with the
Trip.Total variable on the
x axis. Note that we only supply the
x variable in our
aes() mapping because the
y axis will automatically show frequency counts.
Print your plot after creating it to see what it looks like!
Create a similar plot called
rideshare_histogram_binwidth, this time setting the
5 to count trip totals in intervals of $5.
Print your plot again to see how it looks with our custom bin width.