This lesson covers intermediate data visualization concepts in R using the ggplot2 package.

In this lesson, we’ll explore a variety of different visualizations in R’s [ggplot2](https://ggplot2.tidyverse.org/) package. We’ll also go over different ways we can customize our plots to better communicate data insights. 

Let’s start with a familiar visualization: the scatter plot, which we can create using a `geom_point()` layer in `ggplot2`. Scatter plots are useful for visualizing the relationship between two variables. The scatter plot below shows the correlation between a movie’s IMDB rating and the number of awards it has won. Unsurprisingly, movies with higher ratings tend to win more awards! 

![Scatter Plot](http://content.codecademy.com/programs/analyze-data-with-r/data-visualization-in-r/intro-scatterplot.png)

Another common plot is the bar plot. Bar plots are useful for visualizing values for variables that can be counted, such as discrete variables (e.g. integers `1, 2, 3`) or categorical variables (e.g. `"cat", "dog", "mouse"`). We can create bar plots using a `geom_bar()` layer in `ggplot2`. The bar plot below shows how many car models of each class are present in the `mpg` dataset. In this lesson, we'll look at statistical transformations that allow us to display summary values in a bar plot, such as means, and different ways of positioning bars within our bar plots. 

![Bar Plot](https://content.codecademy.com/programs/analyze-data-with-r/data-visualization-in-r/intro-barplot.png)

While each geom can visualize many kinds of data, some geoms communicate certain insights more clearly than others. For example, if we wanted to see the distribution of height among the U.S. population, a bar plot wouldn’t be the best choice because height is a continuous variable with many possible values.

Sometimes we might also need to visualize data across more than two variables. In this lesson, we’ll cover the concept of “facets” which let us show additional discrete variables (in addition to `x` and `y` axes) by dividing a plot into different sections. 

Lastly, we’ll often want to adjust the axes of our plots, add error bars to show variance, and more. By the end of this lesson, you'll know how to create and customize many kinds of data visualizations using `ggplot2`. Let's get started!


Hours of Day by Sleep State, Diet, and Order

Introduction

Histograms let us visualize the distribution of a continuous variable, in contrast to bar plots which show counts and other values for discrete and categorical variables. Histograms divide values of a variable into bins, which are ranges of values that get counted together. For example, if a variable had values `1` through `100` and we specify that we want 5 bins, each bin would have a range of `100 / 5 = 20`. The first bin would count the frequency of values `1` to `20`, the second bin would count the frequency of values `21` to `40`, and so on. 

We can construct a histogram using `geom_histogram()`. The code below creates a histogram using R's built-in `airquality` dataset containing atmospheric measurements from New York City. This histogram shows frequencies of `Ozone` values, measuring the amount of air pollution recorded within a given time period.  

```r
airquality_histogram <- 
  ggplot(airquality, aes(x = Ozone)) +
  labs(title = "Air Quality: Ozone Distribution") +
  geom_histogram()
```
This produces the following plot. We see that ozone levels are clustered towards the lower end of the range (a good thing!), though there were days with much higher ozone levels as well.

![Histogram](http://content.codecademy.com/programs/analyze-data-with-r/data-visualization-in-r/hist_example-1.png)

By default, `ggplot2` automatically calculates 30 equally sized bins. Frequently we’ll want to specify a range per bin that better fits our data; for example, if we wanted to examine the distribution of weight in pounds for a population of house cats, it would make sense for each bin to represent one pound rather than some arbitrary decimal amount. We can set the width of bins using the `binwidth` argument. The code below creates the same plot as before, now with a `binwidth` of `10`.  

```r
airquality_histogram_binwidth <- 
  ggplot(airquality, aes(x = Ozone)) + 
  labs(title = "Air Quality: Ozone Distribution") +
  geom_histogram(binwidth = 10)
```
Take a look at our new plot with `binwidth` set to `10`. Notice how the shape of the histogram is now more smooth with fewer local peaks. 

![Histogram](http://content.codecademy.com/programs/analyze-data-with-r/data-visualization-in-r/hist_example-2.png)

Histograms

Heatmaps let us visualize frequencies along two variables. A heatmap looks like a scatterplot, but uses color-coded squares rather than individual points to indicate how many cases occurred at the intersection of `x` and `y` value ranges. Like histograms, we can specify bin widths to control which ranges of values get counted together. 

Here’s an example heatmap using our `airquality` dataset, mapping `Ozone` values (ozone pollution) on the `x` axis and `Solar.R` values (solar radiation) on the `y` axis. Notice how each region in the heatmap is color-coded to the number of cases with values in the relevant bin ranges. In this dataset, there are many occurrences where both solar radiation levels and ozone levels are low. 

![Air Quality: Ozone and Solar Radiation](http://content.codecademy.com/programs/analyze-data-with-r/data-visualization-in-r/heatmap_example-1.png)

We can create heatmaps using the `geom_bin2d()` layer. This geom takes many of the same arguments as `geom_histogram()`, with slight differences given that a heatmap represents two variables. Like histograms, heatmaps automatically calculate 30 equally sized bins, which may not make sense for many datasets. To specify bin widths for each variable, we can pass a vector of widths instead of a single value, e.g. `geom_bin2d(binwidth = c(1, 5))` to set `x` axis bin widths to `1` and `y` axis bin widths to `5`.

The code below constructs the heatmap shown at the start of this exercise. We set the `binwidth` of both axes using a vector, specifying that each axis should use bins of width `25`. 

```r
airquality_heatmap <- 
  ggplot(airquality, 
         aes(x = Ozone, y = Solar.R)) + 
  labs(title = "Air Quality: Ozone and Solar Radiation") +
  geom_bin2d(binwidth = c(25, 25))
```

Heatmaps

Box plots, also known as box-and-whisker plots, show the distribution of data by quartiles. Box plots are useful in showing how much a variable varies across values of another variable -- are most cases similar in value, or is there a wide range between the highest and lowest values? 

In the box plot below, we see the distribution of temperatures for different months within a subset of the `airquality` dataset. As we would expect for New York City, the summer months have the highest temperatures. The center of the box represents the median temperature. The upper and lower bounds of the box show the 75th and 25th percentiles respectively. The whiskers extend up to 1.5 times the distance between the 75th and 25th percentiles. Beyond the whiskers, outliers are shown as points. 

![Box Plot: Temperature by Month](http://content.codecademy.com/programs/analyze-data-with-r/data-visualization-in-r/anotated_box_plot.svg)

We can create a box plot using the `geom_boxplot()` layer. The code below creates the box plot shown above, visualizing temperature by month in the `airquality` data.
```r
airquality_boxplot <- 
  ggplot(airquality, 
         aes(x = Month, y = Temp)) + 
  labs(title = "Air Quality: Temperature by Month") +
  geom_boxplot()
```
Note that box plots show medians, not means. We'll cover how to display mean values using bar plots later in this lesson.

Box Plots

Many times we are interested in seeing percentages within our data or how different values add up. We can do this using a stacked bar plot. 

Let’s turn to the `msleep` dataset included in `ggplot2` describing the number of hours spent asleep vs awake for various animals. We have pre-processed this data to include just the variables we care about. Take a look at the output panel to see how this dataset looks. 

Now, suppose we want to show the number of hours spent awake versus asleep each day for members of the order Proboscidea, i.e. elephants. We want to show this in a stacked bar plot because the hours awake versus asleep always add up to 24, and we are interested in depicting the proportion of each day spent in each state. The plot below tells us that for both African and Asian elephants, the vast majority of their day is spent awake.

![Stacked Bar Plot: Hours of Day by Sleep State](http://content.codecademy.com/programs/analyze-data-with-r/data-visualization-in-r/stackedbar_example-1.png)

The code below creates the plot we just saw. We specify a `fill` variable in our `aes()` mapping to tell `ggplot2` which variable should be depicted as color-coded segments within our stacked bars. Adding `stat = "identity"` to `geom_bar()` displays the values in our data frame as is, rather than displaying counts. 

```r
# Filter our data to include only elephants
msleep_filtered <- msleep %>%
  filter(order == 'Proboscidea')

# Construct a stacked bar plot
msleep_stackedbar <- 
  ggplot(msleep_stacked_df, 
         aes(x = name, 
             y = hours, 
             fill = status)) + 
  geom_bar(stat = "identity") +
  labs(title = "Hours of Day by Sleep State")
```
We can explicitly add `position = "stack"` in our geom telling it to stack different values of the `fill` variable on top of each other; this is also assumed by default if we don’t specify any positioning. 
```r
# This creates the same plot!
msleep_stackedbar <- 
  ggplot(msleep_stacked_df, 
         aes(x = name, 
             y = hours, 
             fill = status)) + 
  geom_bar(position = "stack", 
           stat = "identity") +
  labs(title = "Hours of Day by Sleep State")
```
We can also create this same plot using the `geom_col()` geom, which works just like `geom_bar()` except it assumes `stat = "identity"` by default.
```r
# This also creates the same plot!
msleep_stackedbar <- 
  ggplot(msleep_stacked_df, 
         aes(x = name, 
             y = hours, 
             fill = status)) + 
  geom_col() +
  labs(title = "Hours of Day by Sleep State")
```

Stacked Bar Plots

Instead of stacking our `fill` variable in our bar graphs, we can also represent values of the variable side by side. This is known as a clustered bar plot. 

The plot below visualizes our `msleep` data for elephants as a clustered bar plot instead of a stacked bar plot. Notice how the values shown are the same as in our stacked bar plot, but the positioning is now different. 

![Clustered Bar Plot: Hours of Day by Sleep State](http://content.codecademy.com/programs/analyze-data-with-r/data-visualization-in-r/clusteredbar_example-1.png)

We can easily create a clustered bar plot by setting `position = "dodge"` instead of `position = "stack"` in our geom. The `dodge` positioning tells `ggplot2` to display each value of the `fill` variable next to each other (as opposed to on top of each other in `"stack"`). Once again, we can either use `geom_bar(stat = "identity")` or `geom_col()`.

```r
# Filter our data to include only elephants
msleep_clustered_df <- msleep %>%
  filter(order == "Proboscidea")

# Construct a clustered bar plot
msleep_stackedbar <- 
  ggplot(msleep_clustered_df, 
         aes(x = name, 
             y = hours, 
             fill = status)) + 
  labs(title = "Hours of Day by Sleep State") +
  geom_bar(position = "dodge", 
           stat = "identity")

# This creates the same plot!
msleep_stackedbar <- 
  ggplot(msleep_clustered_df, 
         aes(x = name, 
             y = hours, 
             fill = status)) + 
  labs(title = "Hours of Day by Sleep State") +
  geom_col(position = "dodge")
```

Clustered Bar Plots

By default, bar plots using `geom_bar()` show the count of observations for each value. We can also show other types of data, such as calculating and showing the mean instead.

Let’s say we want to see how much an animal sleeps on average by the kind of food it eats, based on the `msleep` dataset. The code below does just that! Recall that passing `stat = "identity"` to a `geom_bar()` layer tells `ggplot2` to display values as is, rather than count the number of occurrences. We can similarly use `stat = "summary"`, which tells `ggplot2` to summarize values according to a provided function. We can specify `fun = "mean"` to summarize our `y` axis variable by calculating mean values for each value in our `x` axis variable. 

```r
# Filter our data to include only hours spent asleep, omitting NA values
msleep_means_df <- msleep %>%
  filter(status == "asleep") %>%
  na.omit()

# Construct a bar plot calculating and displaying means
msleep_meanbar <- 
  ggplot(msleep_means_df, 
         aes(x = diet, 
             y = hours)) + 
  labs(title="Mean Hours Asleep by Diet") +
  geom_bar(stat = "summary", 
           fun = "mean")
```
Here's how this plot looks. In the `msleep` dataset, insectivores (animals that eat insects) sleep for fifteen hours a day on average, which is far more than animals with other diets!

![Bar Plot: Mean Hours Asleep by Diet](http://content.codecademy.com/programs/analyze-data-with-r/data-visualization-in-r/meanbar_example-1.png)

Statistical Summaries

Often, we’ll want to show not only the mean of a value but also its standard error. This tells us how much variation there is around the mean -- are most values close to the averages shown, or is there a wide range of values above and below the average? 

To add error bars, we need to first calculate our standard errors. Continuing with our `msleep` dataset, the code below calculates means and standard errors for hours asleep by diet. We compute new variables `mean.hours` representing the mean of `hours`, `mean.se` representing the standard error of our means, and `se.min` and `se.max` representing the lower and upper bounds of our error range. 

```r
msleep_error_df <- msleep %>%
  subset(status = "asleep") %>%
  na.omit() %>%
  group_by(diet) %>%
  summarize(mean.hours = mean(hours), 
            mean.se = std.error(hours)) %>%
  mutate(se.min = mean.hours - mean.se, 
         se.max = mean.hours + mean.se)
```
We can then create our bar plot showing means and standard errors as follows. We add a `geom_errorbar()` layer specifying `ymin` and `ymax` variables for our error bar and set its width to `0.2` of the bar’s width. Because we’ve already calculated mean values in our summary data frame, we specify `stat = "identity"` to display the `mean.hours` values as is. 
```r
msleep_sebar <- 
  ggplot(msleep_error_df, 
         aes(x = diet, 
             y = mean.hours)) + 
  geom_bar(stat = "identity") + 
  geom_errorbar(aes(ymin = se.min, 
                    ymax = se.max), 
                width = 0.2) + 
  labs(title = "Mean Hours Asleep by Diet")
```
This produces the following plot, which should look familiar from the last exercise! Now that we have error bars, we can see how much variation there is around the mean hours asleep for each diet group.

![Error Bars: Mean Hours Asleep by Diet](http://content.codecademy.com/programs/analyze-data-with-r/data-visualization-in-r/errorbar_example-1.png)

Error Bars

Frequently, we’ll want to customize our axes to represent our data more clearly. For discrete variables, such as categories on the `x` axis of a bar plot, we may want to specify a particular order these values appear in. Or, we might want to rename the axes labels so they better describe each value.

We can customize discrete variables on the `x` axis with the `scale_x_discrete()` layer. There is also a `scale_y_discrete()` layer that works the same way for variables on the `y` axis. 

Let’s take a look at the hours of sleep by diet plot we created in the last exercise. We can pass a vector to the `limits` argument in `scale_x_discrete()` to specify that we only want to show bars for `c("omni", "carni", "herbi")`, in that order rather than the default alphabetical ordering. 
```r
msleep_start <- 
  ggplot(msleep_error_df, 
         aes(x = diet, 
             y = mean.hours)) + 
  geom_bar(stat = "identity") + 
  geom_errorbar(aes(ymin = se.min, 
                    ymax = se.max), 
                width = 0.2) + 
  labs(title = "Mean Hours Asleep by Diet")

msleep_discrete <- 
  msleep_start +
  scale_x_discrete(
    limits = c("omni", "carni", "herbi"))
```
This produces the following plot. `"insecti"` is now omitted on the `x` axis. `"omni"` appears on the left, whereas it would appear on the right in the default alphabetical ordering.

![Discrete Axes Limits: Mean Hours Asleep by Diet](http://content.codecademy.com/programs/analyze-data-with-r/data-visualization-in-r/discrete_example-1.png)

We might also want the labels to be more descriptive -- for example, `"carnivore"` is a more familiar word compared to `"carni"`. We can pass a vector of value-to-label mappings to the `labels` argument in `scale_x_discrete()` to specify what we want to show for each existing value. 
```r
msleep_discrete <- 
  msleep_start + 
  scale_x_discrete(
    limits = c("omni", "carni", "herbi"), 
    labels = c("carni" = "Carnivore", 
               "herbi" = "Herbivore", 
               "omni" = "Omnivore"))
```
This produces the following plot. Now the labels are much more clear! The ordering we specify in `limits` is still preserved. 

![Discrete Axes Labels: Mean Hours Asleep by Diet](http://content.codecademy.com/programs/analyze-data-with-r/data-visualization-in-r/discrete_example-2.png)

Customizing Discrete Axes

Similarly to discrete variables, we can add a `scale_x_continuous()` layer to customize continuous variables on our `x` axis, or a `scale_y_continuous()` layer to customize continuous variables on our `y` axis. We can also add a `coord_cartesian` layer to specify the range of values shown on a given axis, allowing us to zoom in and out of the plot region. 

Continuing with our plot of sleep hours by diet, let’s set the `y` axis to be between `8` and `12`. Most animals sleep at least few hours a day, so we don't need our `y` axis to start exactly at `0`. We can adjust our `y` axis ranges by adding a `coord_cartesian()` layer and passing a vector `c(8, 12)` to its `ylim` argument, specifying the min and max values of the `y` axis.

```r
msleep_start <- 
  ggplot(msleep_error_df, 
         aes(x = diet, 
             y = mean.hours)) + 
  geom_bar(stat = "identity") + 
  geom_errorbar(aes(ymin = se.min, 
                    ymax = se.max), 
                width = 0.2) + 
  scale_x_discrete(
    limits = c("omni", "carni", "herbi"), 
    labels = c("carni" = "Carnivore", 
               "herbi" = "Herbivore", 
               "omni" = "Omnivore")) +
  labs(title = "Mean Hours Asleep by Diet")

msleep_continuous <- 
  msleep_start + 
  coord_cartesian(ylim = c(8, 12))
```

Now our plot looks like this. Our `y` axis begins at `8` and ends at `12` as we specified. The differences between each bar’s height are much more obvious -- omnivores sleep the most on average, while herbivores sleep the least!

![Axes Ranges: Mean Hours Asleep by Diet](http://content.codecademy.com/programs/analyze-data-with-r/data-visualization-in-r/continuous_example-1.png)

We can also customize the labels shown for the `y` axis tick marks using the `breaks` argument of the `scale_y_continuous()` layer. In the code below, we pass a vector `c(8, 10, 12)` to specify that we only want tick marks on the `y` axis to appear at `8`, `10`, and `12`. 

```r
msleep_continuous <- 
  msleep_start + 
  coord_cartesian(ylim = c(8, 12)) +
  scale_y_continuous(breaks = c(8, 10, 12))
```

Here is our new plot: 

![Axes Ranges: Mean Hours Asleep by Diet](http://content.codecademy.com/programs/analyze-data-with-r/data-visualization-in-r/continuous_example-2.png)

Finally, we can apply custom transformations to our tick mark labels. Let’s say we want to add a unit of measurement "hrs" to each number on the `y` axis. We can pass a custom function to the `labels` argument of `scale_y_continuous()`. Here, we are telling `ggplot2` to take the automatic labels and add the characters `" hrs"` after each label. 
```r
show_as_hours <- function(x) {
  output <- paste0(x, " hrs")
  return(output)
}

msleep_continuous <- 
  msleep_start + 
  coord_cartesian(ylim = c(8, 12)) +
  scale_y_continuous(labels = show_as_hours, 
                     breaks = c(8, 10, 12))
```
Here’s what our plot with its newly labeled `y` axis looks like: 

![Axes Ranges: Mean Hours Asleep by Diet](http://content.codecademy.com/programs/analyze-data-with-r/data-visualization-in-r/continuous_example-3.png)

Customizing Continuous Axes

Facets allow us to visualize multiple discrete variables in one plot, showing each value of the facet variable in a different section.  

The plot below shows our familiar hours slept by diet plot, this time with the addition of a third variable describing the taxonomic order of animals in our data. We see that rodents of all three diet types sleep much more than primates! 

![Facets: Mean Hours Asleep by Diet and Order](http://content.codecademy.com/programs/analyze-data-with-r/data-visualization-in-r/facet_example-1.png)

We can add facets to our plot by adding a `facet_grid()` layer and specifying the variables it maps to. To show values of a facet variable as rows, we specify `facet_grid(rows = vars(row.var))`. To show values of a facet variable as columns, we specify `facet_grid(columns = vars(col.var))`. We can also show two facet variables in a grid, e.g. `facet_grid(rows = vars(row.var), columns = vars(col.var))`. 

The code below produces the plot shown at the beginning of this exercise. First, we process our data frame to include the records we want and calculate mean hours asleep. Next, we create our sleep by diet plot, this time mapping the `order` variable to a `facet_grid()` layer split into columns. 
```r
msleep_facets_df <- msleep %>%
  filter(status == "asleep") %>%
  na.omit() %>%
  group_by(diet, order) %>%
  summarize(mean.hours = mean(hours)) %>%
  filter(order %in% c("Primates", "Rodentia"))

msleep_facet <- 
  ggplot(msleep_facets_df, 
         aes(x = diet, y = mean.hours)) + 
  geom_bar(stat = "identity") + 
  labs(title = "Mean Hours Asleep by Diet") +
  scale_x_discrete(
    limits = c("omni", "carni", "herbi"),
    labels = c("carni" = "Carnivore",
               "herbi" = "Herbivore", 
               "omni" = "Omnivore")) +
  facet_grid(cols = vars(order))
```

Facets

You've completed the Data Visualization in R lesson! You now know how to choose and implement different kinds of geoms in `ggplot2`, how to customize your plot axes, and how to visualize additional variables through facets. 

Below is a summary of the key concepts you learned -- great job!

**How to create different geoms and when to use which type:**
  * Histograms can be created using `geom_histogram()` to show the distribution of a continuous variable.
  * Heatmaps can be created using `geom_bin2d()` to show the distribution of the intersections of two continuous variables.
  * Box-and-whisker plots can be created using `geom_boxplot()` to show the distribution of a continuous variable by quantiles, e.g. 25th, 50th, and 75th percentiles.
  * Bar plots can be created using `geom_bar()`, which shows the count of observations for different values of a discrete variable by default. `geom_col()` will create bar plots showing the value of the variable on the `y` axis rather than counts.
  * Using the `position` argument and a `fill` aesthetic mapping, we can create stacked bar plots (`position = "stack"`), stacked bar plots showing ratios (`position = "fill"`), and clustered bar plots (`position = "dodge"`).

**How to show different statistics in our data:**
  * The `stat` argument allows us to display different kinds of values.
  * `stat = "identity"` will show the `y` axis variable values on a bar plot as is, rather than displaying the `x` axis value counts.
  * `stat = "summary"` combined with a function supplied in `fun` will display bar heights based on the summary function. For example, `stat = "summary", fun = "mean"` will calculate and display means.

**How to add error bars to bar plots to show variance around a mean:**
  * `geom_error()` creates error bars on bar plots when provided `ymin` and `ymax` variables representing the upper and lower bounds of error ranges.

**How to customize discrete and continuous axes:**
  * We can customize discrete axes using `scale_x_discrete()` and `scale_y_discrete()`. 
  * We can customize continuous axes using `scale_x_continuous()` and scale_y_continuous()`. 
  * We can zoom in on a region of our data using `coord_cartesian()`.

**How to show additional variables in panels of a grid using facets:**
  * By adding `facet_grid()`, we can map up to two additional variables along facet columns and rows.

Review

Intermediate Data Visualization in R

Dive deeper into ggplot2 and learn how to make a variety of different types of visualizations

Intermediate Data Visualization With ggplot2

It is a good practice to specify a `binwidth` that fits your data. Otherwise `ggplot2` will automatically calculate 30 bins by default, which may not be ideal.  

By default, `ggplot2` will automatically calculate the best bin width for your data, creating an arbitrary number of bins that differs each time. 


Histograms show the mean values of observations within each bin. 


Histograms are better-suited for discrete variables than continuous variables.

We want to see the median, 75th, and 25th percentiles of height by gender.

We want to see the count of individuals by their height in inches across the entire U.S. population.


We want to see the mean height in the U.S. by gender.


We want to see how many women in the U.S. are above and below 6 feet tall. 


The color of bars. When setting `fill` to a variable, the interior colors of bar segments will be color-coded according to values of that variable. 

The opacity of the color filled in within bars. When setting `fill` to a variable, higher values will result in more opacity. 


The outline color for bars. When setting `fill` to a variable, the outline colors of bar segments will be color-coded according to values of that variable. 


The pattern for bars. When setting `fill` to a variable, bars will be filled with different patterns according to values of that variable. 


If no `position` argument is supplied but a `fill` variable is provided in the `aes()` mapping, `position = “dodge”` is implied in `geom_bar()`, creating a clustered bar plot. 


Adding `position = “dodge”` to a `geom_bar()` layer will create a clustered bar plot with bar segments side by side.


Adding `position = “stack”` to a `geom_bar()` layer will create a stacked bar plot with bar segments on top of each other. 


If no `position` argument is supplied but a `fill` variable is provided in the `aes()` mapping, `position = “stack”` is implied.


If `stat` is not supplied in a `geom_bar()` geom, the plot will calculate and display means by default. 


`stat = “identity”` tells `ggplot2` to display values of a variable as is in a bar plot. 


`stat = “summary”` combined with a `fun` argument will run the specified function to display summary statistics for a variable, such as means. 


`geom_col()` creates the same visualization as `geom_bar(stat = “identity”)`. 


We need to tell the `geom_errorbar()` layer which variables represent the upper and lower bounds of our error bars. 


When creating a bar plot that shows means using `stat = “summary”`, we can automatically generate error bars by adding `se = TRUE`. 


`geom_errorbar()` layers do NOT take any `aes()` mappings.


By default, error bars created by `geom_errorbar()` will have variable widths based on what is most visually appealing for the given dataset. 


We can set the `pct` argument in `scale_y_discrete()` to `TRUE` such that labels appear with a percentage sign after the number. 


We can set the `labels` argument in `scale_y_discrete()` to a custom function such that axis labels are transformed according to that function.


We can use a `coord_cartesian()` layer to adjust the axis minimum and maximum by zooming in on a specified range of data. 


We can use the `breaks` argument in `scale_y_continuous()` to set the axis breaks, i.e. the tick marks and associated labels shown.


This quiz will evaluate learners' understanding of the content covered in the Data Visualization in R lesson.

Create a series of data visualizations to explore the distribution of museums by geographies, subject matter, and annual revenue.

Let’s start by loading our dataset. We’ve provided a file named `museums.csv`. Load this file into a data frame named `museums_df`.


Take a look at the head of this data frame. Make sure to click through using the arrows in the header to see all the available columns. 

The `Museum.Name` column represents the name of each individual institution, while the `Legal.Name` column represents the name of each institution's parent entity. For example, if `"Codecademy University"` has two museums on campus, each of those museums would have their own names under `Museum.Name` and both would share the same `Legal.Name` i.e. `"Codecademy University"`.

In this section, we'll explore the distribution of institutions in our dataset by type. Our data frame contains a column called `Museum.Type` describing what kind of museum each location is -- a history museum, a zoo, an aquarium, etc. Create and print a bar plot called `museum_type` that maps `Museum.Type` to the `x` axis and counts the frequency of each type on the `y` axis. Which category is most common?


The plot we just created is hard to read because our categories are long. Add a `scale_x_discrete()` layer to customize our `x` axis, using the function `scales::wrap_format(8)` to reformat our labels. 

`wrap_format()` is a function from the `scales` packages which comes included with `ggplot2`. By setting the value of `wrap_format()` to `8`, we are telling it that the maximum width per line should be no more than `8` characters. 

Now we should be able to see which category is most common. Great job! Give yourself a pat on the back.

We've included a boolean (`TRUE` or `FALSE`) column in our data frame called `Is.Museum`. The `TRUE` category includes typical museums like art, history, and science museums. The `FALSE` category includes zoos, aquariums, nature preserves, and historic sites, which are included in this data but aren't what most people think of when they hear the word "museum."

Create a new bar plot called `museum_class`, mapping `Is.Museum` to the `x` axis. Since "TRUE" and "FALSE" aren't very descriptive, use `scale_x_discrete()` to rename the `x` axis labels to more easily understood terms -- for example, "Museum" vs "Non-Museum". 

Instead of looking at the distribution across the entire United States, maybe we're just interested in a few states. Filter `museums_df` to include a few states you might be interested in, using the `State..Administrative.Location.` column; for example, we can choose `IL`, `CA`, and `NY`. Call this filtered data frame `museums_states`. 

After creating `museums_states`, recreate our bar plot showing the distribution of museums vs non-museums and use `facet_grid()` to display each state's distribution in a separate panel. Call this plot `museum_facet`. How does the distribution of museum vs non-museum vary across the states you chose?

Our data also contains information on each museum's region, representing groups of states. Create a stacked bar plot using `museums_df` showing the count of museums by region (`Region.Code..AAM.`), mapping `Is.Museum` to the `fill` aesthetic. Convert `Region.Code..AAM.` to a factor (e.g. `factor(Region.Code..AAM.)`) so `ggplot2` plots its levels as discrete rather than continuous values. Call this plot `museum_stacked`.

Our plot is hard to read -- right now, we don't know what the region numbers correspond to. Use `scale_x_discrete()` to rename the numeric labels to text according to the following table.

| Code | Region          |
|------|-----------------|
| 1    | New England     |
| 2    | Mid-Atlantic    |
| 3    | Southeastern    |
| 4    | Midwest         |
| 5    | Mountain Plains |
| 6    | Western         |

Similarly, add a `scale_fill_discrete()` layer to relabel the "TRUE" and "FALSE" labels in our legend to "Museum" and "Non-Museum".

Based on the plot we created, which region has the most museums?

Rather than seeing counts, perhaps we're more interested in the percentage of museums vs non-museums by region. Transform the plot we just created to a stacked bar plot showing values out of 100% by passing `position = "fill"` to our `geom_bar()` layer. Apply the `scales::percent_format()` function to transform our `y` axis labels into percentage values. 

How does the distribution of museum types vary by region?

Our graph looks pretty good! However, our axes titles are a little non-descript. Using the `labs()` layer, let's title this plot "Museum Types by Region", relabel the `x` axis title as "Region", relabel the `y` axis title as "Percentage of Total", and relabel the `fill` legend title as "Type". 

Now, someone can take a look at this plot and immediately understand what is being described. There were a lot of steps here, but now our plot is clear and professional. Give yourself a pat on the back, and feel free to take a 10 minute coffee break before the next section! 

For the next few tasks, we'll switch to looking at how much money each institution brought in and how that varies across geographies. Because we only have revenue data at the parent organization level, we'll want to first filter our dataset to omit any duplicates. Next, we'll create a few data frames from our starting data to look at different groups of museums by how much money they bring in.

Create a new data frame called `museums_revenue_df` that retains only unique values of `Legal.Name` in `museums_df`. Additionally, filter this data frame to include only entities with `Annual.Revenue` greater than 0. 

Create a second data frame from `museums_revenue_df` (the first data frame we created in this task) called `museums_small_df` that retains only museums with `Annual.Revenue` less than $1,000,000.

Create a third data frame from `museums_revenue_df` (the first data frame we created in this task) called `museums_large_df` that retains only museums with `Annual.Revenue` greater than $1,000,000,000.

Let's start by visualizing the distribution of annual revenue for our small museums dataset. Create a histogram called `revenue_histogram` using `museums_small_df` with `Annual.Revenue` mapped to the `x` axis. Experiment with different `binwidth` values to see what works best for our data, considering that our `x` axis variable ranges from 0 to \$1,000,000. 

Our `x` axis is a little hard to read. Let's make it more clear! Add a `scale_x_continuous()` layer applying the function `scales::dollar_format()` to our `x` axis labels. `dollar_format()` is a function from the `scales` library included in `ggplot2` that adds dollar signs and commas to monetary data.

Now, let's look at the variation in revenue for large museums by region. Create a boxplot called `revenue_boxplot` using `museums_large_df`, mapping `Region.Code..AAM.` to the `x` axis and `Annual.Revenue` to the `y` axis. Remember to convert `Region.Code..AAM.` to a factor (e.g. `factor(Region.Code..AAM.)`) so `ggplot2` plots its levels as discrete rather than continuous values. Use `scale_x_discrete()` to rename the numeric region codes to their text equivalents.


The plot we just created is a little hard to read, since there's one outlier so far above all the other data points. This one museum made a lot of money in 2013! Let's zoom in so we can see the rest of our boxes more clearly. Add a `coord_cartesian()` layer setting `ylim` to `c(1e9, 3e10)`. This tells our plot to zoom in on the `y` axis range between \$1,000,000,000 and \$30,000,000,000. How do the median, 75th, and 25th quantiles vary by region?


Though we can see the distribution by region more clearly now, our `y` axis label is hard to understand. Let's reformat our `y` axis as billions of dollars. The function defined below will convert values like \$1,000,000,000 to \$1B, which is much easier to read. Add a `scale_y_continuous()` layer to map our `y` axis labels using this function. 

```r
function(x) paste0("$", x/1e9, "B")
```

We've made our box plot much clearer to read. Great work! If you're feeling tired, you can always take a quick break -- you've earned it!

Now, let's take a look at revenue across all museums in our dataset. Using `museums_revenue_df`, create a bar plot called `revenue_barplot` mapping `Region.Code..AAM.` to the `x` axis and `Annual.Revenue` to the `y` axis. Remember to transform `Region.Code..AAM.` to a factor (`factor(Region.Code..AAM.)`) so it shows up as discrete values. 

Use `stat = "summary"` and `fun = "mean"` to calculate and display the mean revenue by region. Apply the appropriate `x` and `y` axis label transformations to make our labels more clear. Which region has the highest and lowest mean revenues?


Once again, use the `labs` layer to make our plot more clear. Title the plot "Mean Annual Revenue by Region", relabel the `y` axis title to "Mean Annual Revenue", and relabel the `x` axis title to "Region".


Finally, let's add some error bars to our means. We'll need to calculate our standard errors before creating our plot. You can do so using the following code:

```r
museums_error_df <- museums_revenue_df %>%
  group_by(Region.Code..AAM.) %>%
  summarize(
    Mean.Revenue = mean(Annual.Revenue), 
    Mean.SE = std.error(Annual.Revenue)) %>%
  mutate(
    SE.Min = Mean.Revenue - Mean.SE, 
    SE.Max = Mean.Revenue + Mean.SE)
```

Add error bars to our mean revenue by geography bar plot using the `geom_errorbar()` layer. Call this new plot `revenue_errorbar`. Our new `y` variable is our calculated `Mean.Revenue` column. Make sure to change the `stat` being used, since we're now displaying our calculated means as is rather than calculating them as we create the plot. Which regions have more or less variability around their mean revenues?

Congratulations -- you've completed this project! Awesome work today.

Museums and Nature Centers

Create graphs to explore the increase in carbon dioxide levels on Earth's atmosphere over the last centuries compared to earth's carbon dioxide levels over millennia. 

Let's first explore the dataset containing the data from the World Data Center for Paleoclimatology, Boulder and NOAA Paleoclimatology Program. Import the `"carbon_dioxide_levels.csv"` and save it to a new variable named `noaa_data`. 

Inspect the head of the data frame. What are the names of the two columns? What types of values are in each?

Let's visualize this data. First, create a new variable named  `noaa_viz` that is equal to a new `ggplot()` object and assign `noaa_data` as its `data` argument. Be sure to state the name of the variable after you define it so that it is rendered to the R notebook.

Define your scales by creating an aesthetic mapping that maps `Age_yrBP` on the x-axis and `CO2_ppmv` on the y-axis as part of the canvas.

In climate science, it's common to create line graphs to best portray the fluctuations in the levels of carbon dioxide. Add a `geom_line()` layer to the `noaa_viz` plot.

Let's add context to the plot and improve its legibility. Title the plot `"Carbon Dioxide Levels From 8,000 to 136 Years BP"` and add a subtitle that cites the data `"From World Data Center for Paleoclimatology and NOAA Paleoclimatology Program"`.

Tweak the axis labels so they are more descriptive than the column headers. The x-axis should read `"Years Before Today (0=1950)"` and the y-axis should read `"Carbon Dioxide Level (Parts Per Million)"`

Currently, the order of the years is counterintuitive. Since the most recent date is the date closest to 0, or 1950 as Before Physics is described, we want the years on the x-axis arranged in descending order. Add this to `noaa_viz`:

```R
noaa_viz + scale_x_reverse(lim=c(800000,0)) 
```

In the second code block, let's explore the second dataset containing the data for the last 2014 years. Import the `"yearly_co2.csv"` file and save it to a new variable named `iac_data`. 

Inspect the head. What are the names of the four columns? 
What types of values are in each? Note that the `data_mean_global` is an equivalent metric to `CO2_ppmv`. We will not be using the other two columns in this project. What's different about the year column in this dataset?

Again, let’s create a new `ggplot()` object named `iac_viz` and associate `iac_data` as its data argument. Let's make a new variable named `iac_viz`. Be sure to state the name of the variable after you define it so that it is rendered to the R notebook.

Define your scales by creating an aesthetic mapping that maps `year` on the x-axis and `data_mean_global` on the y-axis as part of the canvas. 

Note: The dataset column headers are different than the ones in the previous data frame. Years are chornological starting from 0 up to 2014, not in terms of BP. The `data_mean_global` references the same metric as `C02_ppmv` for carbon dioxide average parts per million in the earth's atmosphere.

A line graph also makes sense for these data. Let’s explore how much carbon dioxide was stored in the atmosphere over the past two millennia by adding a `geom_line()` layer to the `iac_viz` plot.

This plot still needs labels to add context to the plot. Title the plot `"Carbon Dioxide Levels over Time"` and add a subtitle that cites the data `"From Institute for Atmospheric and Climate Science (IAC)."`

Tweak the axis labels so they are more descriptive than the column headers. The x-axis should read `"Year"` and the y-axis should read `"Carbon Dioxide Level (Parts Per Million)"`

Let's highlight the rise in carbon dioxide levels by adding a horizontal line that represents the maximum level in the first chart spanning over 8,000 years of carbon dioxide data. On a new line of code in the block, create a new variable named `millennia_max` and retrieve the maximum value of the `CO2_ppmv` column in the `noaa_data`. Print the value so you can see what it is.

Now that we have the maximum number in the `noaa_data` let's map it on our `iac_data` plot. There's a geom in ggolot called `geom_hline()` that plots a horizontal line. Add a `geom_hline()` layer to `iac_viz` that has a `yintercept` value in its aesthetic mapping of `millennia_max`.

Add one more argument to the horizontal line's aesthetic mapping so that the legend can display information about what the line represents. Assign the value of the `linetype` argument as `"Historical CO2 Peak before 1950"`

+ What do you notice has happened in the last 100 years relative to the last 8 millennia?



Visualizing Carbon Dioxide Levels

The data, the aesthetics, and the geometries.

If you want to inspect the correlation between two variables.

If you want to show the distribution of the dataset.

If you want to show the variability of a dataset.

The `ggplot()` function maps data on the visualization.

The arguments specified inside the `ggplot()` function are inherited by all of the subsequent layers. 

The `ggplot()` function binds the data to the ggplot object, or the "canvas" of the graph.

The `ggplot()` function is the first function call required to create any ggplot visualization. 

Aesthetic mappings are data-driven and wrapped in an `aes()` function while manual aesthetics provide a single visual value.

Aesthetic mappings provide visual instructions while manual aesthetics do not.

Manual aesthetics are data-driven and wrapped in an `aes()` function while aesthetic mappings provide a single visual value.

`labs(title="Monthly Rent vs Apartment Size in Brooklyn, NY", subtitle="Data by StreetEasy (2017)", x="Monthly Rent ($)", y="Apartment Size (sq ft.)")`


`labs(header="Monthly Rent vs Apartment Size in Brooklyn, NY", caption="Data by StreetEasy (2017)", x="Monthly Rent ($)", y="Apartment Size (sq ft.)")`


`labs(title="Monthly Rent vs Apartment Size in Brooklyn, NY", subtitle="Data by StreetEasy (2017)", xlab="Monthly Rent ($)", ylab="Apartment Size (sq ft.)")`


Test your introductory knowledge of ggplot2

Introduction to ggplot2

Learn the general grammar needed to make visualizations with the popular R visualization package, ggplot2.

Hello there! Today, we will be learning how to make visualizations with R's popular [ggplot2 package](https://ggplot2.tidyverse.org/). You can create simple visualizations with base R, but we are choosing to teach you ggplot2 because it is relatively easy to implement, well-documented, and supported by a community of data enthusiasts. 

The ggplot2 package is part of the tidyverse so it comes equipped with other data cleaning and processing tools. The package is designed with the grammar of graphics in mind. This grammar of graphics describes a general pattern to follow when you're creating visualization. In fact, that's where the "gg" in ggplot comes from: **g**rammar of **g**raphics. ggplot2 has fostered a large programming community so you'll find that as you forage into making your own plot outside of the Codecademy platform, there'll be lots of resources and examples to welcome you.

This lesson will teach you the basic grammar required to create a plot. After you learn the underlying structure or "philosophy" of ggplot2, you can extend the logic to create many types of plots. Once you get the basic structure down, our upcoming lessons will explore how to customize your plot and calculate statistics in your visualization. Let's get started. 


Animated gif showing layers being added to a canvas one by one.

When you learn grammar in school you learn about the basic units to construct a sentence. The basic units in the "grammar of graphics" consist of:

+ The _data_ or the actual information you wish to visualize.
+ The _geometries_, shortened to "geoms", describe the shapes that represent our data. Whether it be dots on a scatter plot, bar charts on the graph, or a line to plot the data! The list goes on. Geoms are the shapes that "map" our data.
+ The _aesthetics_, or the visual attributes of the plot, including the scales on the axes, the color, the fill, and other attributes concerning appearance.

Another key component to understand is that in ggplot2, geoms are "added" as _layers_ to the original canvas which is just an empty plot with data associated to it. 

Once you learn these three basic grammatical units, you can create the equivalent of a basic sentence, or a basic plot. There are more units in the "grammar of graphics," but in this lesson we'll mostly be learning about these three. 

Layers and Geoms

The first thing you'll need to do to create a ggplot object is invoke the `ggplot()` function. Conceptualize this step as initializing the "canvas" of the visualization. In this step, it's also standard to associate the data frame the rest of the visualization will use with the canvas. What do we mean by "the rest" of the visualization? We mean all the layers you'll add as you build out your plot.

As we mentioned, at its heart, a ggplot visualization is a combination of layers that each display information or add style to the final graph. You "add" these layers to a starting canvas, or ggplot object, with a `+` sign. We'll add geometries and aesthetics in the next exercises. For now, let's stop to understand that any arguments inside the `ggplot()` function call are inherited by the rest of the layers on the plot. 

Here we invoke `ggplot()` to create a ggplot object and assign the dataframe `df`, saving it inside a variable named `viz`:
```R
viz <- ggplot(data=df)

viz
```
Note: The code above assigns the value of the canvas to `viz` and then states the variable name `viz` after so that the visualization is rendered in the notebook.

Any layers we add to `viz` would have access to the dataframe. We mentioned the idea of aesthetics before. It's important to understand that any aesthetics that you assign as the `ggplot()` arguments will also be inherited by other layers. We'll explore what this means in depth too, but for now, it's sufficient to conceptualize that arguments defined inside `ggplot()` are inherited by other layers.


Diagram visualizing the inheritance between the canvas, the aesthetics defined with the ggplot function, and the subsequent layers.

The ggplot() function

Before we go any further, let's stop to understand when the data gets bound to the visualization:

+ Data is bound to a ggplot2 visualization by passing a data frame as the first argument in the `ggplot()` function call. You can include the named argument like `ggplot(data=df_variable)` or simply pass in the data frame like `ggplot(data frame)`.
+ Because the data is bound at this step, this means that the rest of our layers, which are function calls we add with a `+` plus sign, all have access to the data frame and can use the column names as variables. 

For example, assume we have a data frame `sales` with the columns `cost` and `profit`. In this example, we assign the data frame `sales` to the `ggplot()` object that is initailized: 

```R
viz <- ggplot(data=sales) + 
       geom_point(aes(x=cost, y=profit))
viz # renders plot
``` 

In the example above:

+ The ggplot object or canvas was initialized with the data frame `sales` assigned to it
+ The subsequent `geom_point` layer used the `cost` and `profit` columns to define the scales of the axes for that particular geom. Notice that it simply referred to those columns with their column names.
+ We state the variable name of the visualization ggplot object so we can see the plot.

Note: There are other ways to bind data to layers if you want each layer to have a different dataset, but the most readable and popular way to bind the dataframe happens at the `ggplot()` step and your layers use data from that dataframe.

Associating the Data

In the context of ggplot, aesthetics are the instructions that determine the visual properties of the plot and its geometries.

Aesthetics can include things like the scales for the x and y axes, the color of the data on the plot based on a property or simply on a color preference, or the size or shape of different geometries. 

There are two ways to set aesthetics, by manually specifying individual attributes or by providing _aesthetic mappings_. We'll explore aesthetic mappings first and come back to manual aesthetics later in the lesson. Aesthetic mappings "map" variables from the data frame to visual properties in the plot. You can provide aesthetic mappings in two ways using the `aes()` mapping function:

+ At the canvas level: All subsequent layers on the canvas will inherit the aesthetic mappings you define when you create a ggplot object with `ggplot()`.
+ At the geom level: Only that layer will use the aesthetic mappings you provide. 


Let's discuss _inherited aesthetics_ first, or aesthetics you define at the canvas level. Here's an example of code that assigns `aes()` mappings for the `x` and `y` scales at the canvas level:

```R
viz <- ggplot(data=airquality, aes(x=Ozone, y=Temp)) +
       geom_point() + 
       geom_smooth()
```

![Aesthetics Inheritance Example](https://content.codecademy.com/courses/Learn-R/aes-inherit-example.png)


In the example above:
+ The aesthetic mapping is wrapped in the `aes()` aesthetic mapping function as an additional argument to `ggplot()`.
+ Both of the subsequent geom layers, `geom_point()` and `geom_smooth()` use the scales defined inside the aesthetic mapping assigned at the canvas level.

You should set aesthetics for subsequent layers at the canvas level if all layers will share those aesthetics.
 


What are aesthetics?

Before we teach you how to add aesthetics specific to a geom layer, let's create our first geom! As mentioned before, geometries or geoms are the shapes that represent our data.

In ggplot, there are many types of geoms for representing different relationships in data. You can read all about each one in the layers section of the [ggplot2 documentation](https://ggplot2.tidyverse.org/reference/). Once you learn the basic grammar of graphics, all you'll have to do is read the documentation of a particular geom and you'll be prepared to make a plot with it following the general pattern.  For simplicity's sake, let's start with the scatterplot geom, or `geom_point()`, which simply represents each datum as a point on the grid. Scatterplots are great for graphing paired numerical data or to detect a correlation between two variables. 

The following code adds a scatterplot layer to the visualization:
```R
viz <- ggplot(data=df, aes(x=col1,y=col2)) +
       geom_point()

```
In the code above:
+ Notice the layer is being added by using a `+` sign which comes after the ggplot object is created, and it comes on the same line.
+ The `geom_point()` function call is what adds the points layer to the plot. This call can take arguments but we are keeping it simple for now.

The code above would render the following plot:
![Scatterplot one layer](https://content.codecademy.com/courses/Learn-R/one-layer.png)

Another popular layer that allows you to eye patterns in the data by completing a line of best fit is the `geom_smooth()` layer. This layer, by nature, comes with a gray error band. You could add a smooth layer to the plot by typing the following:

```R
viz <- ggplot(data=df, aes(x=col1,y=col2)) +
       geom_point() + 
       geom_smooth()
```
+ Notice that you can add layers one on top of the other. We added the smooth line after adding the `geom_point()` layer. We could have just included the point layer, or just the line-of-best-fit layer. But the combination of the two enhances our visual understanding of the data, so they make a great pairing. 
+ It is nice to put each layer on its own line although it is not necessary, since it improves readability in the long run if you're collaborating with other people.

The code above would render the following plot: 
![Plot two layers with smooth line](https://content.codecademy.com/courses/Learn-R/two-layers.png)

Adding Geoms

In the previous exercises, we added geoms to the plot and explored the idea of layers inheriting the original aesthetic mappings of the canvas. Sometimes, you'll want individual layers to have their own mappings. For example, what if we wanted the scatterplot layer to classify the points based on a data-driven property? We achieve this by providing an aesthetic mapping for that layer only. 


Let's explore the aesthetic mappings for the [`geom_point()` layer](https://ggplot2.tidyverse.org/reference/geom_point.html). What if we wanted to color-code the points on the scatterplot based on a property? It's possible to customize the color by passing in an `aes()` aesthetic mapping with the color based on a data-driven property. Observe this example:

```R
viz <- ggplot(data=airquality, aes(x=Ozone, y=Temp)) +
       geom_point(aes(color=Month)) + 
       geom_smooth()

```
The code above would *only* change the color of the point layer, it would not affect the color of the smooth layer since the `aes()` aesthetic mapping is passed at the point layer. 
 
![Aesthetics Inheritance Example](https://content.codecademy.com/courses/Learn-R/aes-example.png)


Note: You can read about the individual aesthetics available for each geom when you read its documentation. There are some aesthetics shared across geoms and others that are specific to a particular ones.

Geom Aesthetics

We've reviewed how to assign data-driven aesthetic mappings at the canvas level and at the geom level. However, sometimes you'll want to change an aesthetic based on visual preference and not data. You might think of this as "manually" changing an aesthetic.

If you have a pre-determined value in mind, you provide a named aesthetic parameter and the value for that property *without* wrapping it in an `aes()`. For example, if you wanted to make all the points on the scatter plot layer dark red because that's in line with the branding of the visualization you are preparing, you could simply pass in a `color` parameter with a manual value `darkred` or any color value like so:

```R
viz <- ggplot(data=airquality, aes(x=Ozone, y=Temp)) +
       geom_point(color="darkred")  
```
+ Note that we did not wrap the `color` argument inside `aes()` because we are manually setting that aesthetic. Here are more aesthetics for the `geom_point()` layer: `x`, `y`, `alpha`, `color`, `fill`, `group`, `shape`, `size`, `stroke`. The `alpha` aesthetic describes opacity of the points, and the `shape` of the dots could be different than a dot. Read more about the values each of these aesthetics take in the [`geom_point()` layer documentation](https://ggplot2.tidyverse.org/reference/geom_point.html).  

The code above would render the following plot:
![Dark scatterplot layer](https://content.codecademy.com/courses/Learn-R/dark-red-scatter.png)
We advise that your aesthetic choices have intention behind them. Too much styling can overcomplicate the appearance of a plot, making it difficult to read. 

Manual Aesthetics

So far, we've reviewed how to add geometries to represent our data. We've also learned how to modify aesthetic values in our plot- whether those aesthetics are data-driven or assigned manually. Another big part of creating a plot is in making sure it has reader-friendly labels. The ggplot2 package automatically assigns the name of the variable corresponding to the different components on the plot as the initial label. Code variable names are unfortunately not always legible to outside readers with no context. 

If you wish to customize your labels, you can add a `labs()` function call to your ggplot object. Inside the function call to `labs()` you can provide new labels for the `x` and `y` axes as well as a `title`, `subtitle`, or `caption`. You can check out the list of available label arguments in the [`labs()` documentation](https://ggplot2.tidyverse.org/reference/labs.html) here. 

The following `labs()` function call and these specified arguments would render the following plot:

```R
viz <- ggplot(df, aes(x=rent, y=size_sqft)) + 
       geom_point() +
       labs(title="Monthly Rent vs Apartment Size in Brooklyn, NY", subtitle="Data by StreetEasy (2017)", x="Monthly Rent ($)", y="Apartment Size (sq ft.)")
viz
```

![Labels are tweaked for readibility](https://content.codecademy.com/courses/Learn-R/labs-layer.png)


Labels

We've gone over each of the basic units in the grammar of graphics: data, geometries, and aesthetics. Let's extend this new knowledge to create a new type of plot: [the bar chart](https://ggplot2.tidyverse.org/reference/geom_bar.html). Bar charts are great for showing the distribution of categorical data. Typically, one of the axes on a bar chart will have numerical values and the other will have the names of the different categories you wish to understand.

Let's build a bar chart by using some of the R built-in datasets. These are data frames that you can readily access in your code to explore and create visualizations. They are  handy because these built-in datasets usually include nicely distributed categorical data.

The `geom_bar()` layer adds a bar chart to the canvas. Typically when creating a bar chart, you assign an `aes()` aesthetic mapping with a single categorical value on the `x` axes and the `aes()` function will compute the count for each category and display the count values on the `y` axis. 

Since we're extending the grammar of graphics, let's also learn about how to save our visuals as local image files.

The following code maps the count of each category in the Language column in a dataset of 100 popular books to a bar length and then saves the visualization as a .png file named `"bar-example.png"`:

```R
bar <- ggplot(books, aes(x=Language)) + geom_bar()
bar
ggsave("bar-example.png")
```
Note: The [`ggsave()` function](https://www.rdocumentation.org/packages/ggplot2/versions/3.1.1/topics/ggsave) allows you to save  visualizations as a local file with the name of your choice. It's a useful function when developing visualizations locally.

The code above outputs the following plot:
![Bar chart displaying different languages spoken in popular books](https://content.codecademy.com/courses/Learn-R/bar-example.png)


Extending The Grammar

You've completed the introduction to ggplot lesson! You're ready to follow the general pattern for creating a visualization:

1. Determine what relationship you wish to explore in your data
2. Find the right geom(s) in the [ggplot2 documentation](https://ggplot2.tidyverse.org/reference/) to display that relationship and read about the arguments and aesthetics specific to that geom
3. Extend the grammar of graphics to follow the pattern learned in this lesson to add layers and create a visualization. Improve graph legibility by polishing labels and styles.

Some of the key concepts you learned in this lesson include:

+ The basic units of grammar include data, geoms, and aesthetics.
+ The dataframe associated to the plot by using the `ggplot()` function creates a ggplot object that is known as the canvas.
+ The geometries or geoms are the shapes that display the data. Geometries become layers as you add them to your ggplot object.
+ The aesthetics are visual instructions you provide the plot. Aesthetics can be inherited or specified at the geom level.
+ Aesthetic mappings are data-driven visual instructions for the plot.
+ You can add context to your plot by customizing its labels with the `labs()` function


Intro to Visualization with R

### Why Learn ggplot2? 

This course is a great introduction to both fundamental data visualization concepts and the R programming language. R is used by professionals in the Data Analysis and Data Science fields as part of their daily work. 



### Take-Away Skills 
In this course, you will learn how to use one of R’s most popular packages — ggplot. This package is all about creating beautiful images based on your data. We’ll walk you through how to create some of the most common graphs and charts like histograms, scatter plots, and bar charts. We will also show you how to customize your images. R gives you total control over the colors you want to use, the labels you display, and even the tick marks on your axes. Let’s dive in and start making some great images!


Learn the basics of how to create visualizations using the popular R package ggplot2.

Learn R: Fundamentals of Data Visualization with ggplot2

Learn how to create visualizations using the popular R package ggplot2

Learn ggplot2

PRO SALE: Get 50% off annual Pro memberships using code [LLM50](https://www.codecademy.com/checkout?plan_id=proGoldAnnualV2&discountCode=LLM50&plan_type=pro)