Learn
Linear Regression in R

Ready for the real fun? We’ve done our due diligence and confirmed that our data fulfills the assumptions of simple linear regression models; we’ve split our data into test and training subsets, and properly built a model using y ~ x + b notation; we’ve even taken the time to assess the fit of our model using both quantitative and qualitative approaches. Now we can finally analyze the results of our model and discover the relationship between user advertisement clicks and the purchase rate of related products!

We can view the results of a linear regression model in R by calling summary() on the model variable to which we saved the results of our call to lm(). The summary() function will print out a lot of information about our model–– but don’t be overwhelmed! There are four primary sections of quantitative results that are crucial to interpreting regression models:

Call

This section simple displays the call to lm() which created these model results. It’s a helpful reminder of which version of a outcome-predictor pair is currently under analysis.

Residuals

As covered in our earlier exercises, a residual is the difference between the value of an outcome variable predicted by the model and the actual observed value of the variable. The summary() output displays a set of numbers that summarize the distribution of residuals in our model, including minimum/maximum residual values, the values first/third quantiles, and the median residual value for the model. We’ve already analyzed our residual values by creating a plot in an earlier exercise, but these summary values are a helpful reminder of the overall spread of our model errors.

Coefficients

Estimate

Coefficients are most important results in the interpretation of regression models. The number you see in the Estimate column, (a value of 0.048939 for clicks) is called a regression coefficient. Looking back to formal definition of a linear regression model:

$Y = beta_0 + beta_1*X + error$

The regression coefficient is represented by the beta_1 variable. This linear regression equation tells us that the regression coefficient represents the expected change in the dependent variable (in our case total_convert) for a one-unit increase in the independent variable (clicks). In other words, for every additional click on an advertisement, the expected sales of a related product are estimated to increase by 0.049 dollars. In addition to the size of the coefficient, it is also important to note the sign of the coefficient. If our clicks coefficient was negative, our model would be estimating that the sales of a product actually decreases every time an advertisement is clicked.

Std. Error

The column adjacent to Estimate is called Std. Error; the standard error of each coefficient is the estimate of the standard deviation of the coefficient. It is crucial to note that the standard error is not a quantity of interest by itself, but depends on the value of our regression coefficient

T-value and Pr(>|t|)

The t value and Pr(>|t|) inherently answer the same question–– given the value of our variable’s regression coefficient and its’ standard error, does the variable explain a significant part of the change in our outcome variable? However, the Pr(>|t|) column purposely provides a more concise response to this question, using the asterisk notation that corresponds with the Signif. codes legend at the bottom of the Coefficients results section. In R model output, one asterisk means “p < .05”. Two asterisks mean “p < .01”; and three asterisks mean “p < .001”. These values are referred to as p-values in scientific literature. How can we use p-values to answer our question around model significance?

Asterisks in a regression table indicate the level of the statistical significance of a regression coefficient. Our understanding of statistical significance is based off of the idea of a random sample. When interpreting these asterisk values, we ask ourselves: if there truly is no relationship between clicks on an advertisement and product sales, then what are chances that, across many user clicks on an ad, we see behavior that suggests that there is no relationship?

For our clicks variable, with *** in the Pr(>|t|) column, the answer is very unlikely. The value of ***, or p < .001, means that random sample resulting in the regression coefficient and standard error that we observed for clicks-given that there was truly no difference relationship between clicks and product purchase—would occur in less than one time in a random draw of 100, on average. Given that we would so rarely observe the situation that suggests that there is no relationship between our click and total_convert, we can say that there is a statistically significant relationship between the two variables. Generally speaking, scientists accept that a variable coefficient with p-value less than 0.05 is statistically significant.

Measures of Model Fit

At the bottom of the output of summary() are a series of labeled metrics, like Residual Standard Error (RSE) and Multiple R-squared, which quantify the fit of our model. Our previous exercises have covered how to interpret and plot many of these measures, but it’s a helpful reminder for them to be summarized along with other model output.

Wow! There is so much information conveyed within a simple call to our model results. Continue on to the second half of this lesson to practice the interpretation of model results, and learn more about how to select a best fit model.