Our next step is to check for outlier data points. Linear regression models also assume that there are no extreme values in the data set that are not representative of the actual relationship between predictor and outcome variables. A box-and-whisker plot is a common method used to quickly determine whether a data set contains any outliers, or data points that differ significantly from other observations in a dataset. An outlier may be caused by variability in measurement, or it might be a sign of an error in the collection of data.
geom_boxplot() method allows for the easy creation of box-and-whisker plots. To plot the distribution of a single variable, like
advertising$sales––the total number of sales for a product in a month–– we pass in the same variable as both x and y in our call to
plot <- advertising %>% ggplot(aes(sales, sales)) + geom_boxplot()
In this case, it looks like there are a handful of negative
sales values in the dataset. This is not what we would expect given our understanding of the data; how could an entire market have negative average sales over an entire year? This seems like an error stemming from the collection of this data into a spreadsheet format. In this case, we will filter out these negative datapoints from our dataset using the
filter() method. We can pass a boolean argument into
filter() to exclude values that resolve to
advertising <- advertising %>% filter(Sales > 0)
Let’s check our variables for any outliers. Use a combination of the base
ggplot() function and
geom_boxplot() to plot a boxplot of the
clicks variable; assign the result to a variable called
clicks_bx_plot. Don’t forget to pass
clicks in as the
y variable, and afterwards call
clicks_bx_plot in order to view the plot!
Do any data points look like outliers? Set
threshold equal to the value of
clicks above which all data points fall outside of the whiskers of our box plot.
filter() method to remove all outlying values from
clicks. Save the resulting data frame to a variable called
Let’s check our work! Create another box plot of
clicks, but this time, use the
convert_clean dataset and save the plot to
clean_bx_plot and note any differences from
clicks_bx_plot. Are the outliers gone?