Histograms let us visualize the distribution of a continuous variable, in contrast to bar plots which show counts and other values for discrete and categorical variables. Histograms divide values of a variable into bins, which are ranges of values that get counted together. For example, if a variable had values 1
through 100
and we specify that we want 5 bins, each bin would have a range of 100 / 5 = 20
. The first bin would count the frequency of values 1
to 20
, the second bin would count the frequency of values 21
to 40
, and so on.
We can construct a histogram using geom_histogram()
. The code below creates a histogram using R’s built-in airquality
dataset containing atmospheric measurements from New York City. This histogram shows frequencies of Ozone
values, measuring the amount of air pollution recorded within a given time period.
airquality_histogram <- ggplot(airquality, aes(x = Ozone)) + labs(title = "Air Quality: Ozone Distribution") + geom_histogram()
This produces the following plot. We see that ozone levels are clustered towards the lower end of the range (a good thing!), though there were days with much higher ozone levels as well.
By default, ggplot2
automatically calculates 30 equally sized bins. Frequently we’ll want to specify a range per bin that better fits our data; for example, if we wanted to examine the distribution of weight in pounds for a population of house cats, it would make sense for each bin to represent one pound rather than some arbitrary decimal amount. We can set the width of bins using the binwidth
argument. The code below creates the same plot as before, now with a binwidth
of 10
.
airquality_histogram_binwidth <- ggplot(airquality, aes(x = Ozone)) + labs(title = "Air Quality: Ozone Distribution") + geom_histogram(binwidth = 10)
Take a look at our new plot with binwidth
set to 10
. Notice how the shape of the histogram is now more smooth with fewer local peaks.
Instructions
Our workspace contains a dataset called rideshare_df
describing rideshare trips in the city of Chicago. Examine this dataset by calling the head()
function with rideshare_df
as an argument. Click through the arrows in the table header to see all of the columns in this data frame.
Lets visualize the distribution of total trip cost across the rideshare_df
dataset. Construct a histogram called rideshare_histogram
with the Trip.Total
variable on the x
axis. Note that we only supply the x
variable in our aes()
mapping because the y
axis will automatically show frequency counts.
Print your plot after creating it to see what it looks like!
Create a similar plot called rideshare_histogram_binwidth
, this time setting the binwidth
to 5
to count trip totals in intervals of $5.
Print your plot again to see how it looks with our custom bin width.