Data Scientists are often interested in building models to make predictions on new data. While the
add_predictions() function from the
modelr package makes it easy to predict new values from a technical standpoint, it is far more difficult to develop and assess accurate predicted values.
The most common metric used to compute the accuracy of predicted values is mean squared error on test data. Similar to residual squared error (RSE) and R-squared, MSE measures the average squared difference between predicted and observed values. When we are working with just one model, it is helpful to compare the difference between MSE on our training dataset, and MSE on test data. We can calculate training MSE for model using a combination of
add_predictions() creates and adds predicted values from a model to a column called
summarise() then allows us to calculate the mean of the squared difference between our observed values (
sales) and predicted values (
train %>% add_predictions(model) %>% summarise(MSE = mean((sales - pred)^2)) #output MSE 31.60713
We can use the same combination of functions to calculate MSE for our test dataset, which results in a MSE of around 32.5. Testing MSE will almost always be higher than training MSE, as the model has been built off of training data; however, it is important to confirm that there is not a substantial difference between model training and test MSE. The value of using MSE to quantify prediction accuracy is more clear when comparing multiple models, as it allows us to determine which versions of a model best predicts an outcome variable. For instance, we could compute the MSE for a model of tv spending on sales.
model2 <- lm(sales ~ tv, data = train) train %>% add_predictions(model2) %>% summarise(MSE = mean((sales - pred)^2)) #output MSE 27.28415
Comparing the train MSE for our tv-based model, at 27.28, to our train MSE for a podcast-based model, at 31.60, it is clear that the predictions from the tv-based model are more accurate, as the model’s MSE is lower. If a data scientist was trying to predict the expected volume of sales for a future business quarter, it would be a better idea for them to base their estimations off of a tv-based model.
The code used to build two simple linear models,
model, a regression of clicks on
model2, a regression of
total_convert, is already included in your notebook.
summarize() to calculate the test mean squared error of
model (MSE), and save the result to a variable called
Make sure to use the
summarize() to calculate the MSE of
model2, and save the result to a variable called
Print out both
mse_impressions. Which one is smaller? What does this tell us about the accuracy of our models on the test data?
Let’s plot the predicted test values of the model with the smallest MSE against the observed test data.
First, use a combination of
geom_point()to plot the observed values from the test data.
When creating the
ggplot() add an
aes() where the
y values correspond to the variables used when making the correct model.
Save the visualization to a variable called
plot, then print out the variable.
Now we can add in our predicted values! Add another call to
geom_point(), explicitly passing in
pred as the
y value in
aes(), and set the
color parameter equal to blue.