Now that we’ve seen what a regression model with a binary predictor looks like visually, we can actually fit the model using
statsmodels.api.OLS.from_formula(), the same way we did for a quantitative predictor:
model = sm.OLS.from_formula('height ~ play_bball', data) results = model.fit() print(results.params)
Intercept 169.016 play_bball 14.628 dtype: float64
Note that this will work if the
play_bball variable is coded with
1s, but it will also work if it is coded with
Falses, or even if it is coded with strings like
'no' (in this case, the coefficient label will look something like
play_bball[T.yes] in the
params output, indicating that
'yes' corresponds to a
To interpret this output, we first need to remember that the intercept is the expected value of the outcome variable when the predictor is equal to zero. In this case, the intercept is therefore the mean height of non-basketball players.
The slope is the expected difference in the outcome variable for a one unit difference in the predictor variable. In this case, a one unit difference in
play_bball is the difference between not being a basketball player and being a basketball player. Therefore, the slope is the difference in mean heights for basketball players and non-basketball players.
students dataset has been loaded for you in script.py. Create and fit a regression model of
score predicted by
sm.OLS.from_formula() and print out the coefficients.
Code has been provided for you in script.py to calculate the mean test score for students who ate breakfast (saved as
mean_score_breakfast) and the mean score for students who did not eat breakfast (saved as
mean_score_no_breakfast). Calculate and print the difference in mean scores. Can you find how this number relates to the regression output?