We’ve been able to really dig into the results of simple linear regression models and show how the results convey a substantial amount of information about the relationship between two variables. However, by this point you might be wondering–– what if I think variables other than
podcast have contribute to the total sales of a product? You might remember that the primary assumption behind simple linear models is that the expected value of the outcome variable is a straight-line function of exclusively the predictor variable. This means that our simple linear models assume that all variation in the outcome variable is explained by the predictor variable. In the case of our
sales dataset, we know this is almost certainly not true; oftentimes more money is spent on TV or newspaper ads than on podcasts, so this spending might have an even larger effect than podcast spend.
Thankfully, there are methods to include the effects of
newspaper in linear regression models. We can expand our model definition from a simple model of one predictor variable to a multiple model of, you guessed it, multiple predictor variables. The formal definition of multiple linear regression models is a direct extension of the formula for simple linear regression:
As in a simple linear model, Y represents the dollar value of products sold, X represents the amount spent on respective product podcast ads, and Beta_0 is the model intercept. Now, Beta_1, Beta_2, and Beta_3 represent each the coefficients of predictor variables. To build a similar model in R, using the standard
lm() package, we still use the formula notation of Y ~ X:
model <- lm(sales ~ podcast + tv, data = train)
While building a multiple regression model is a straightforward extension of the code used to a build a simple model and the output of the model results below looks quite similar, a bit more effort goes into the interpretation of the results of this model. Remember that in a simple linear regression model, the regression coefficient represents the expected change in the dependent variable for a one-unit increase in the independent variable. In other words, the coefficient for
podcast represents the expected increase in
sales given a one dollar increase podcast advertisement spend. Because multiple linear regression includes more than one predictor variable, the coefficient estimates must be interpreted differently. In multiple linear regression, the regression coefficient represents the expected change in the dependent variable for a one-unit increase in the independent variable, holding all other variables in the model constant. Expand the width of your narrative panel to view the output of the multiple linear regression model below:
summary(model) #output Call: lm(formula = sales ~ TV + podcast, data = train) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.583386 1.024616 4.473 1.65e-05 *** TV 3.006340 1.004924 7.380 1.62e-11 *** podcast 1.049249 1.027665 5.395 3.10e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
For example, a call to
summary(model) shows that the coefficient for
podcasts is equal to
1.04944. This means that, when one more dollar is spent on podcast advertising, about
1.049 more dollars of the related product is sold, given that there is no increase in the amount of money spent on tv advertisements. In this way, multiple linear regression models allow us to isolate the unique effect of one predictor variable on the outcome variable.
As this example shows, the selection of variables in a regression model can have wide-ranging impacts on the results and interpretation of our models! Let’s dive into one more exercise to practice building and interpreting multiple linear regression models.