Once we have an understanding of the kind of relationship our model describes, we want to understand the extent to which this modeled relationship actually fits the data. This is typically referred to as the goodness-of-fit. In simple linear models, we can measure this quantitatively by assessing two things:
- Residual standard error (RSE)
- R squared (R^2)
The RSE is an estimate of the standard deviation of the error of the model (error in our mathematical definition of linear regression). Roughly speaking, it is the average amount that the response will deviate from the true regression line. We get the RSE at the bottom of summary(model)
, we can also get it directly with
sigma(model) #output 3.2
An RSE value of 3.2 means the actual sales in each market will deviate from the true regression line by approximately 3,200 units, on average. Is this too large of a deviation? Well, that’s subjective, but when compared to the average value of sales over all markets the percentage error is 22%:
sigma(model)/mean(train$sales) # output [1] 0.2207373
The RSE provides an absolute measure of lack of fit of our model to the data. But since it is measured in the units of Y, it is not always clear what constitutes a good RSE.
The R^2 statistic provides an alternative measure of fit. It represents the proportion of variance explained, so it always takes on a value between 0 and 1, and is independent of the scale of Y, our outcome variable. Similar to RSE, the R^2 can be found at the bottom of summary(model)
but we can also extract it directly by calling summary(model)$r.squared
. The result below suggests that podcast advertising budget can explain 64% of the variability in the total sales
value.
summary(model)$r.squared # output [1] 0.6372581
Instructions
The code used to produce model
, a simple linear model summarizing the relationship between clicks
and total_convert
is already included in your notebook.
- Assign the result of
sigma(model)/mean(train$total_convert)
toavg_rse
- Uncomment the following f-string, then run the file to see how we would contextualize the average RSE of our model
Model fit is often quantified in comparison to other models, then used to determine which variation of a modeled relationship best fits the data. Let’s build a second model so that we can contextualize our fit metrics.
Assign the result of building a simple linear model of impressions
, the total number of times a user views a version of an advertisement, on total_convert
to the variable model_2
.
Let’s use a combination of R’s variable selection syntax, the $
character, and summary()
to investigate the percent of variability explained by both model
and model2
.
- Call extract the r-squared measure from
model
, and save the result to a variable calledr_sq
. - Call extract the r-squared measure from
model_2
, and save the result to a variable calledr_sq_2
.
Print out both r-square variables. Which model better explains a user’s likelihood of purchasing a product they have been shown an advertisement for?
Uncomment the final f-string, then run the file to see how we would provide a narrative around the R^2 statistic and determine which model better explains user purchase behavior.