Learn

Great! We can build a model! But… how do we know if it’s any good? Also, if another data scientist builds a different model using a different independent variable, how can we tell which model is “best”? Even within Statistics, “best” can be a subjective qualifier. However, scientists who use regression models generally agree that the best model is the one that minimizes the distance between a data point and the estimation line drawn by a model. The vertical distance between a datapoint and the line estimated by a regression model is called a residual; residuals and their aggregations are the fundamental units of measures of regression model fit and accuracy.

Because residuals are based on cartesian distances, it often helps to visualize their values. For instance, consider the plot of a simple linear regression alongside its training data below. Note one point is 4 units above the regression estimate line; in this example, the residual for that point is 4. Meanwhile, another point is 2 units below the regression estimate line; the residual for that point is -2. A data point is best fit by the model which results in the smallest residual for that point.

residual graph

When scientists make quantitative arguments for a best fit model, they rely on an aggregation, often the sum or average, of residual values across an entire dataset. While is it easy to be overwhelmed by the variety of measures used to argue that one model is better than the other, it is crucial to realize that all measures are grounded in the simple difference between regression estimate and observed data point. Let’s produce a visualization of our own model of clicks on total_convert to better understand our model residuals.

Instructions

1.

First, let’s pull out the points that make up the estimate line, and respective residual values, from our model. We can save them to columns back in our train dataset called estimate and residuals in our main train training dataset.

  • Call predict() on model and save the result to train$estimate.
  • Call residuals() on model and save the result to train$residuals
2.

Plot the values of clicks and total_convert using a combination of ggplot() and geom_point(). Save the result to a variable called plot. Don’t forget to pass in clicks as as X variable, and total_convert as a Y variable. Make sure you’re using train as the dataset.

Call plot to view your visualization.

3.

Plot the observed data points in our train dataset by adding another call to geom_point(). Make sure to explicitly pass in estimate as a y value, and set the color parameter equal to "blue".

4.

Let’s explicitly plot the vertical distance between estimate values and their respective observed data point. Add a call to geom_segment(), passing in xend = clicks and yend = estimate as arguments to aes(). Don’t forget to set color = "gray"!

5.

Finally, we should provide another way to observe the size of residuals. Update our first call to geom_point() by passing in size = abs(residuals) as an argument to aes(). As total_convert increases, how do the value of model residuals change?

Sign up to start coding

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?