Learn

Our next step is to check for outlier data points. Linear regression models also assume that there are no extreme values in the data set that are not representative of the actual relationship between predictor and outcome variables. A box-and-whisker plot is a common method used to quickly determine whether a data set contains any outliers, or data points that differ significantly from other observations in a dataset. An outlier may be caused by variability in measurement, or it might be a sign of an error in the collection of data.

Regardless, `ggplot`’s `geom_boxplot()` method allows for the easy creation of box-and-whisker plots. To plot the distribution of a single variable, like `advertising\$sales`––the total number of sales for a product in a month–– we pass in the same variable as both x and y in our call to `geom`:

``````plot <- advertising %>%
ggplot(aes(sales, sales)) +
geom_boxplot()`````` In this case, it looks like there are a handful of negative `sales` values in the dataset. This is not what we would expect given our understanding of the data; how could an entire market have negative average sales over an entire year? This seems like an error stemming from the collection of this data into a spreadsheet format. In this case, we will filter out these negative datapoints from our dataset using the `filter()` method. We can pass a boolean argument into `filter()` to exclude values that resolve to `false`.

``advertising <- advertising %>% filter(Sales > 0)``

### Instructions

1.

Let’s check our variables for any outliers. Use a combination of the base `ggplot()` function and `geom_boxplot()` to plot a boxplot of the `clicks` variable; assign the result to a variable called `clicks_bx_plot`. Don’t forget to pass `clicks` in as the `y` variable, and afterwards call `clicks_bx_plot` in order to view the plot!

2.

Do any data points look like outliers? Set `threshold` equal to the value of `clicks` above which all data points fall outside of the whiskers of our box plot.

3.

Use the `filter()` method to remove all outlying values from `clicks`. Save the resulting data frame to a variable called `convert_clean`.

4.

Let’s check our work! Create another box plot of `clicks`, but this time, use the `convert_clean` dataset and save the plot to `clean_bx_plot`. Call `clean_bx_plot` and note any differences from `clicks_bx_plot`. Are the outliers gone?