Codecademy Logo

Intermediate Data Visualization With ggplot2

Histograms In R

In R, the geom_histogram() function from the ggplot2 library will create a histogram. The binwidth argument sets the width of the bins in the histogram.

If the binwidth argument is not used, the histogram will create 30 bins by default of equal size. It is recommended to use the binwidth argument to make the histogram smoother.

Histograms are used to visualize the distribution of a continuous variable.

# Creates a histogram of the Ozone feature from the dataset airquality. In this case, each bin will have a width of 10.
airquality_histogram_binwidth <-
ggplot(airquality, aes(x = Ozone)) +
geom_histogram(binwidth = 10)

Boxplots In R

In R, the geom_boxplot() function from the ggplot2 library will create a boxplot. There should be an aesthetic with defined x and y arguments.

# Creates a boxplot using the airquality data frame where the Month feature is on the x axis and the Temp feature is on the Y axis
airquality_boxplot <-
ggplot(airquality, aes(x = Month, y = Temp)) +
geom_boxplot()

The Fill Argument

When creating a stacked bar plot in R, the fill argument of an aes() determines which feature should be depicted as color-coded segments within the bars of the bar plot.

In the code example, each bar is broken into the different possible values found in the status feature.

# Using data from the data frame named df, a bar plot is created where each bar is broken into different colors based on the values found in the "status" column.
msleep_stackedbar <-
ggplot(df, aes(x = name, y = hours, fill = status)) +
geom_bar(stat = "identity")

The stat Parameter

When creating a bar chart in R, the geom_bar() function has a stat parameter describes the values on the y axis of the bar chart. If stat = "identity", then the bar chart will display the values in the data frame as is. By default, the bar chart will display the count of the values in the data frame.

Instead of using geom_bar(stat = "identity"), you could use geom_col() to achieve the same results.

# The following two lines of code will produce the same results
ggplot(msleep_stacked_df, aes(x = name, y = hours, fill = status)) +
geom_bar(stat = "identity") +
ggplot(msleep_stacked_df, aes(x = name, y = hours, fill = status)) +
geom_col()

The Position Argument

When creating a bar plot and using the fill argument, you can specify how you to visualize your segments using the position argument.

Setting position = "stack" will create a stacked bar plot where each bar is broken into multiple colors.

Setting position = "dodge" will create a clustered bar plot where bar segments are placed side by side rather than on top of each other.

#Creates a clustered bar plot. Each bar is broken into segments based on the status column. Those segments are placed side by side.
msleep_stackedbar <-
ggplot(msleep_clustered_df, aes(x = name, y = hours, fill = status))+
geom_bar(position = "dodge", stat = "identity")

Error Bars In R

Error bars can be added to a bar plot in R by using the geom_errorbar() function from the ggplot2 library.

This function should take an aes() with ymin and ymax values to determine the end of the error bar.

# This makes a bar chart with error bars. The variables se.min and se.max are columns in the dataframe msleep_error_df that we previously calculated to store the minimum and maximum error values.
msleep_sebar <-
ggplot(msleep_error_df, aes(x = diet, y = mean.hours)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = se.min, ymax = se.max), width = 0.2)

Customizing Discrete Axes In R

When creating a graph in R with discrete values, we can customize the axes using scale_x_discrete() and scale_y_discrete().

These functions have the argument limits which takes a vector of strings. These strings will be the values shown on the axis in the order that they are in the vector.

# The labels on the x axis will be omni, carni, and herbi in that order.
msleep_discrete <-
msleep_start +
scale_x_discrete(limits = c("omni", "carni", "herbi"))

Customizing Continuous Axes In R

When creating a graph in R with continuous axes, the scale_x_continuous() and scale_y_continuous() functions can customize those axes.

The breaks parameter takes a vector of values. Those values will be the tick marks shown on the axis.

The coord_cartesian() function can change the scale of axes. This function has two relevant parameters named xlim and ylim. Those parameters take vectors of two numbers that will be the endpoints of the axes.

# The coord_cartesian will set the y axis of the msleep graph to be between 8 and 12. You can use this to effectively "zoom in" on a section of the graph.
msleep_final <-
msleep +
coord_cartesian(ylim = c(8, 12))

Facets In R

When creating graphs in R, graphs can be split into different sections based on discreet variables using the facet_grid() function.

# Adding the call to facet_grid() to a visualization will split the visualization into different sections. In this case, different columns will be created based on the possible values in the "order" column.
final <- original +
facet_grid(cols = vars(order))

Histograms and bar plots

Histograms are intended to visualize the distribution of a continuous variable. The height of the bar in each bin represents the number of observations in each bin. In contrast, bar plots often represent the count of observations as well, but for discrete variables instead.

Learn more on Codecademy