When doing any type of statistical analysis, we should always keep the assumptions in mind. Multiple linear regression requires some of the same assumptions as simple linear regression:
- Linear functional form, which can be assessed by plotting the outcome variable against the predictor variable and looking for a linear relationship
- Normality, which can be assessed by plotting a histogram of the residuals and looking for an approximately normal distribution
- Homoscedasticity, which can be assessed by plotting residuals against fitted values and confirming that there is no clear pattern
In addition, we also have to check that the predictors are not linearly related. This is referred to as multicollinearity and can lead to misleading results.
We can detect multicollinearity by checking the correlations between pairs of variables in our data. Correlations close to 1 or -1 may be considered too closely related to both be included in a model. The following code calculates the correlation pairs from dataset
df and saves them as
corr_grid = df.corr()
For easy visual detection, we can use Python’s
heatmap() function from
seaborn to create a heat map of correlations between quantitative variables in a dataset. The code to produce a heat map from
corr_grid is shown below.
sns.heatmap(corr_grid, xticklabels=corr_grid.columns, yticklabels=corr_grid.columns, vmin=-1, center=0, vmax=1, cmap='PuOr', annot=True) plt.show()
The heat map above is particularly dark purple (near 1) for the
rooms variables, indicating a strong linear relationship (corr = 0.95). If we were running a multiple regression to predict
price, we might decide to keep only one of those two variables in order to avoid multicollinearity.
student dataset has been loaded for you in script.py. Get the correlations for pairs of quantitative variables in the
student dataset and save them as
Add code to create a heat map of the correlations of quantitative variable pairs and inspect it. Why might the pairs near 1 or -1 be so similar to each other?