Great! We can build a model! But… how do we know if it’s any good? Also, if another data scientist builds a different model using a different independent variable, how can we tell which model is “best”? Even within Statistics, “best” can be a subjective qualifier. However, scientists who use regression models generally agree that the best model is the one that minimizes the distance between a data point and the estimation line drawn by a model. The vertical distance between a datapoint and the line estimated by a regression model is called a residual; residuals and their aggregations are the fundamental units of measures of regression model fit and accuracy.
Because residuals are based on cartesian distances, it often helps to visualize their values. For instance, consider the plot of a simple linear regression alongside its’ training data below. Note one point is 4 units above the regression estimate line; in this example, the residual for that point is 4. Meanwhile, another point is 2 units below the regression estimate line; the residual for that point is -2. A data point is best fit by the model which results in the smallest residual for that point.
When scientists make quantitative arguments for a best fit model, they rely on an aggregation, often the sum or average, of residual values across an entire dataset. While is it easy to be overwhelmed by the variety of measures used to argue that one model is better than the other, it is crucial to realize that all measures are grounded in the simple difference between regression estimate and observed data point. Let’s produce a visualization of our own model of
total_convert to better understand our model residuals.
First, let’s pull out the points that make up the estimate line, and respective residual values, from our model. We can save them to columns back in our
train dataset called
residuals in our main
train training dataset.
predict()on model and save the result to
modeland save the result to
Plot the values of
total_convert using a combination of
geom_point(). Save the result to a variable called
plot. Don’t forget to pass in
clicks as as X variable, and
total_convert as a Y variable. Make sure you’re using
train as the dataset.
plot to view your visualization.
Plot the observed data points in our
train dataset by adding another call to
geom_point(). Make sure to explicitly pass in
estimate as a y value, and set the
color parameter equal to
Let’s explicitly plot the vertical distance between
estimate values and their respective observed data point. Add a call to
geom_segment(), passing in
xend = clicks and
yend = estimate as arguments to
aes(). Don’t forget to set
color = "gray"!
Finally, we should provide another way to observe the size of residuals. Update our first call to
geom_point() by passing in
size = abs(residuals) as an argument to
total_convert increases, how do the value of model residuals change?