There are a number of assumptions of simple linear regression, which are important to check if you are fitting a linear model. The first assumption is that the relationship between the outcome variable and predictor is linear (can be described by a line). We can check this before fitting the regression by simply looking at a plot of the two variables.
The next two assumptions (normality and heteroscedasticity) are easier to check after fitting the regression. We will learn more about these assumptions in the following exercises, but first, we need to calculate two things: fitted values and residuals.
Again consider our regression model to predict weight based on height (model formula
'weight ~ height'). The fitted values are the predicted weights for each person in the dataset that was used to fit the model, while the residuals are the differences between the predicted weight and the true weight for each person. Visually:
We can calculate the fitted values using
.predict() by passing in the original data. The result is a pandas series containing predicted values for each person in the original dataset:
fitted_values = results.predict(body_measurements) print(fitted_values.head())
0 66.673077 1 59.100962 2 71.721154 3 70.711538 4 65.158654 dtype: float64
The residuals are the differences between each of these fitted values and the true values of the outcome variable. They can be calculated by subtracting the fitted values from the actual values. We can perform this element-wise subtraction in Python by simply subtracting one python series from the other, as shown below:
residuals = body_measurements.weight - fitted_values print(residuals.head())
0 -2.673077 1 -1.100962 2 3.278846 3 -3.711538 4 2.841346 dtype: float64
script.py already contains the code to fit a model on the
students dataset that predicts test
hours_studied as a predictor. Calculate the fitted values for this model and save them as
Calculate the residuals for this model and save the result as
Print out the first 5 values in
residuals and inspect them. Can you make sense of these numbers? What is the difference between a positive and negative residual?