Data Scientists are often interested in building models to make predictions on new data. While the `add_predictions()`

function from the `modelr`

package makes it easy to predict new values from a technical standpoint, it is far more difficult to develop and assess accurate predicted values.

The most common metric used to compute the accuracy of predicted values is mean squared error on test data. Similar to residual squared error (RSE) and R-squared, MSE measures the average squared difference between predicted and observed values. When we are working with just one model, it is helpful to compare the difference between MSE on our training dataset, and MSE on test data. We can calculate training MSE for model using a combination of `add_predictions()`

and `summarise()`

. `add_predictions()`

creates and adds predicted values from a model to a column called `pred`

. `summarise()`

then allows us to calculate the mean of the squared difference between our observed values (`sales`

) and predicted values (`pred`

).

train %>% add_predictions(model) %>% summarise(MSE = mean((sales - pred)^2)) #output MSE 31.60713

We can use the same combination of functions to calculate MSE for our test dataset, which results in a MSE of around 32.5. Testing MSE will almost always be higher than training MSE, as the model has been built off of training data; however, it is important to confirm that there is not a substantial difference between model training and test MSE. The value of using MSE to quantify prediction accuracy is more clear when comparing multiple models, as it allows us to determine which versions of a model best predicts an outcome variable. For instance, we could compute the MSE for a model of tv spending on sales.

model2 <- lm(sales ~ tv, data = train) train %>% add_predictions(model2) %>% summarise(MSE = mean((sales - pred)^2)) #output MSE 27.28415

Comparing the train MSE for our tv-based model, at 27.28, to our train MSE for a podcast-based model, at 31.60, it is clear that the predictions from the tv-based model are more accurate, as the model’s MSE is lower. If a data scientist was trying to predict the expected volume of sales for a future business quarter, it would be a better idea for them to base their estimations off of a tv-based model.

### Instructions

**1.**

The code used to build two simple linear models, `model`

, a regression of clicks on `total_convert`

, and `model2`

, a regression of `impressions`

on `total_convert`

, is already included in your notebook.

Use `add_predictions()`

and `summarize()`

to calculate the test mean squared error of `model`

(*MSE*), and save the result to a variable called `mse_clicks`

.

Make sure to use the `test`

dataset.

**2.**

Use `add_predictions()`

and `summarize()`

to calculate the MSE of `model2`

, and save the result to a variable called `mse_impressions`

.

**3.**

Print out both `mse_clicks`

and `mse_impressions`

. Which one is smaller? What does this tell us about the accuracy of our models on the test data?

**4.**

Let’s plot the predicted test values of the model with the smallest MSE against the observed test data.

First, use a combination of `add_predictions()`

, `ggplot()`

and `geom_point()`

to plot the observed values from the test data.

When creating the `ggplot()`

add an `aes()`

where the `x`

and `y`

values correspond to the variables used when making the correct model.

Save the visualization to a variable called `plot`

, then print out the variable.

**5.**

Now we can add in our predicted values! Add another call to `geom_point()`

, explicitly passing in `pred`

as the `y`

value in `aes()`

, and set the `color`

parameter equal to blue.