Great! We can build a model! But… how do we know if it’s any good? Also, if another data scientist builds a different model using a different independent variable, how can we tell which model is “best”? Even within Statistics, “best” can be a subjective qualifier. However, scientists who use regression models generally agree that the best model is the one that minimizes the distance between a data point and the estimation line drawn by a model. The vertical distance between a datapoint and the line estimated by a regression model is called a residual; residuals and their aggregations are the fundamental units of measures of regression model fit and accuracy.
Because residuals are based on cartesian distances, it often helps to visualize their values. For instance, consider the plot of a simple linear regression alongside its training data below. Note one point is 4 units above the regression estimate line; in this example, the residual for that point is 4. Meanwhile, another point is 2 units below the regression estimate line; the residual for that point is -2. A data point is best fit by the model which results in the smallest residual for that point.
When scientists make quantitative arguments for a best fit model, they rely on an aggregation, often the sum or average, of residual values across an entire dataset. While is it easy to be overwhelmed by the variety of measures used to argue that one model is better than the other, it is crucial to realize that all measures are grounded in the simple difference between regression estimate and observed data point. Let’s produce a visualization of our own model of clicks
on total_convert
to better understand our model residuals.
Instructions
First, let’s pull out the points that make up the estimate line, and respective residual values, from our model. We can save them to columns back in our train
dataset called estimate
and residuals
in our main train
training dataset.
- Call
predict()
on model and save the result totrain$estimate
. - Call
residuals()
onmodel
and save the result totrain$residuals
Plot the values of clicks
and total_convert
using a combination of ggplot()
and geom_point()
. Save the result to a variable called plot
. Don’t forget to pass in clicks
as as X variable, and total_convert
as a Y variable. Make sure you’re using train
as the dataset.
Call plot
to view your visualization.
Plot the observed data points in our train
dataset by adding another call to geom_point()
. Make sure to explicitly pass in estimate
as a y value, and set the color
parameter equal to "blue"
.
Let’s explicitly plot the vertical distance between estimate
values and their respective observed data point. Add a call to geom_segment()
, passing in xend = clicks
and yend = estimate
as arguments to aes()
. Don’t forget to set color = "gray"
!
Finally, we should provide another way to observe the size of residuals. Update our first call to geom_point()
by passing in size = abs(residuals)
as an argument to aes()
. As total_convert
increases, how do the value of model residuals change?