Our next step is to check for outlier data points. Linear regression models also assume that there are no extreme values in the data set that are not representative of the actual relationship between predictor and outcome variables. A box-and-whisker plot is a common method used to quickly determine whether a data set contains any outliers, or data points that differ significantly from other observations in a dataset. An outlier may be caused by variability in measurement, or it might be a sign of an error in the collection of data.
Regardless, ggplot
’s geom_boxplot()
method allows for the easy creation of box-and-whisker plots. To plot the distribution of a single variable, like advertising$sales
––the total number of sales for a product in a month–– we pass in the same variable as both x and y in our call to geom
:
plot <- advertising %>% ggplot(aes(sales, sales)) + geom_boxplot()
In this case, it looks like there are a handful of negative sales
values in the dataset. This is not what we would expect given our understanding of the data; how could an entire market have negative average sales over an entire year? This seems like an error stemming from the collection of this data into a spreadsheet format. In this case, we will filter out these negative datapoints from our dataset using the filter()
method. We can pass a boolean argument into filter()
to exclude values that resolve to false
.
advertising <- advertising %>% filter(Sales > 0)
Instructions
Let’s check our variables for any outliers. Use a combination of the base ggplot()
function and geom_boxplot()
to plot a boxplot of the clicks
variable; assign the result to a variable called clicks_bx_plot
. Don’t forget to pass clicks
in as the y
variable, and afterwards call clicks_bx_plot
in order to view the plot!
Do any data points look like outliers? Set threshold
equal to the value of clicks
above which all data points fall outside of the whiskers of our box plot.
Use the filter()
method to remove all outlying values from clicks
. Save the resulting data frame to a variable called convert_clean
.
Let’s check our work! Create another box plot of clicks
, but this time, use the convert_clean
dataset and save the plot to clean_bx_plot
. Call clean_bx_plot
and note any differences from clicks_bx_plot
. Are the outliers gone?