Learn

Great! We can build a model! But… how do we know if it’s any good? Also, if another data scientist builds a different model using a different independent variable, how can we tell which model is “best”? Even within Statistics, “best” can be a subjective qualifier. However, scientists who use regression models generally agree that the best model is the one that minimizes the distance between a data point and the estimation line drawn by a model. The vertical distance between a datapoint and the line estimated by a regression model is called a residual; residuals and their aggregations are the fundamental units of measures of regression model fit and accuracy.

Because residuals are based on cartesian distances, it often helps to visualize their values. For instance, consider the plot of a simple linear regression alongside its’ training data below. Note one point is 4 units above the regression estimate line; in this example, the residual for that point is 4. Meanwhile, another point is 2 units below the regression estimate line; the residual for that point is -2. A data point is best fit by the model which results in the smallest residual for that point. When scientists make quantitative arguments for a best fit model, they rely on an aggregation, often the sum or average, of residual values across an entire dataset. While is it easy to be overwhelmed by the variety of measures used to argue that one model is better than the other, it is crucial to realize that all measures are grounded in the simple difference between regression estimate and observed data point. Let’s produce a visualization of our own model of `clicks` on `total_convert` to better understand our model residuals.

### Instructions

1.

First, let’s pull out the points that make up the estimate line, and respective residual values, from our model. We can save them to columns back in our `train` dataset called `estimate` and `residuals` in our main `train` training dataset.

• Call `predict()` on model and save the result to `train\$estimate`.
• Call `residuals()` on `model` and save the result to `train\$residuals`
2.

Plot the values of `clicks` and `total_convert` using a combination of `ggplot()` and `geom_point()`. Save the result to a variable called `plot`. Don’t forget to pass in `clicks` as as X variable, and `total_convert` as a Y variable. Make sure you’re using `train` as the dataset.

Call `plot` to view your visualization.

3.

Plot the observed data points in our `train` dataset by adding another call to `geom_point()`. Make sure to explicitly pass in `estimate` as a y value, and set the `color` parameter equal to `"blue"`.

4.

Let’s explicitly plot the vertical distance between `estimate` values and their respective observed data point. Add a call to `geom_segment()`, passing in `xend = clicks` and `yend = estimate` as arguments to `aes()`. Don’t forget to set `color = "gray"`!

5.

Finally, we should provide another way to observe the size of residuals. Update our first call to `geom_point()` by passing in `size = abs(residuals)` as an argument to `aes()`. As `total_convert` increases, how do the value of model residuals change?