Learn

Now that we’ve seen what a regression model with a binary predictor looks like visually, we can actually fit the model using `statsmodels.api.OLS.from_formula()`, the same way we did for a quantitative predictor:

``````model = sm.OLS.from_formula('height ~ play_bball', data)
results = model.fit()
print(results.params)``````

Output:

``````Intercept     169.016
play_bball     14.628
dtype: float64``````

Note that this will work if the `play_bball` variable is coded with `0`s and `1`s, but it will also work if it is coded with `True`s and `False`s, or even if it is coded with strings like `'yes'` and `'no'` (in this case, the coefficient label will look something like `play_bball[T.yes]` in the `params` output, indicating that `'yes'` corresponds to a `1`).

To interpret this output, we first need to remember that the intercept is the expected value of the outcome variable when the predictor is equal to zero. In this case, the intercept is therefore the mean height of non-basketball players.

The slope is the expected difference in the outcome variable for a one unit difference in the predictor variable. In this case, a one unit difference in `play_bball` is the difference between not being a basketball player and being a basketball player. Therefore, the slope is the difference in mean heights for basketball players and non-basketball players.

### Instructions

1.

The `students` dataset has been loaded for you in script.py. Create and fit a regression model of `score` predicted by `breakfast` using `sm.OLS.from_formula()` and print out the coefficients.

2.

Code has been provided for you in script.py to calculate the mean test score for students who ate breakfast (saved as `mean_score_breakfast`) and the mean score for students who did not eat breakfast (saved as `mean_score_no_breakfast`). Calculate and print the difference in mean scores. Can you find how this number relates to the regression output?