Data Scientists are often interested in building models to make predictions on new data. While the add_predictions()
function from the modelr
package makes it easy to predict new values from a technical standpoint, it is far more difficult to develop and assess accurate predicted values.
The most common metric used to compute the accuracy of predicted values is mean squared error on test data. Similar to residual squared error (RSE) and R-squared, MSE measures the average squared difference between predicted and observed values. When we are working with just one model, it is helpful to compare the difference between MSE on our training dataset, and MSE on test data. We can calculate training MSE for model using a combination of add_predictions()
and summarise()
. add_predictions()
creates and adds predicted values from a model to a column called pred
. summarise()
then allows us to calculate the mean of the squared difference between our observed values (sales
) and predicted values (pred
).
train %>% add_predictions(model) %>% summarise(MSE = mean((sales - pred)^2)) #output MSE 31.60713
We can use the same combination of functions to calculate MSE for our test dataset, which results in a MSE of around 32.5. Testing MSE will almost always be higher than training MSE, as the model has been built off of training data; however, it is important to confirm that there is not a substantial difference between model training and test MSE. The value of using MSE to quantify prediction accuracy is more clear when comparing multiple models, as it allows us to determine which versions of a model best predicts an outcome variable. For instance, we could compute the MSE for a model of tv spending on sales.
model2 <- lm(sales ~ tv, data = train) train %>% add_predictions(model2) %>% summarise(MSE = mean((sales - pred)^2)) #output MSE 27.28415
Comparing the train MSE for our tv-based model, at 27.28, to our train MSE for a podcast-based model, at 31.60, it is clear that the predictions from the tv-based model are more accurate, as the model’s MSE is lower. If a data scientist was trying to predict the expected volume of sales for a future business quarter, it would be a better idea for them to base their estimations off of a tv-based model.
Instructions
The code used to build two simple linear models, model
, a regression of clicks on total_convert
, and model2
, a regression of impressions
on total_convert
, is already included in your notebook.
Use add_predictions()
and summarize()
to calculate the test mean squared error of model
(MSE), and save the result to a variable called mse_clicks
.
Make sure to use the test
dataset.
Use add_predictions()
and summarize()
to calculate the MSE of model2
, and save the result to a variable called mse_impressions
.
Print out both mse_clicks
and mse_impressions
. Which one is smaller? What does this tell us about the accuracy of our models on the test data?
Let’s plot the predicted test values of the model with the smallest MSE against the observed test data.
First, use a combination of add_predictions()
, ggplot()
and geom_point()
to plot the observed values from the test data.
When creating the ggplot()
add an aes()
where the x
and y
values correspond to the variables used when making the correct model.
Save the visualization to a variable called plot
, then print out the variable.
Now we can add in our predicted values! Add another call to geom_point()
, explicitly passing in pred
as the y
value in aes()
, and set the color
parameter equal to blue.