In the last exercise, we explored how a coefficient may change when additional predictors are added to a model. When adding predictors entirely reverses the sign of the coefficient, this is called Simpson’s Paradox: one model says the direction of the relationship is positive, while the other says it is negative.

Let’s look at a simple example of this with the fictional dataset `tv`

. We’ll start by fitting a simple regression predicting weekly hours of television watched (`tv_hours`

) from the number of books in the home (`books`

).

import statsmodels.api as sm model0 = sm.OLS.from_formula('tv_hours ~ books', data=tv).fit() print(model0.params) # Output: # Intercept 12.219967 # books 0.089078

The coefficient on `books`

is positive: the model indicates that more books is associated with more weekly hours of television. Now, let’s add a categorical variable to the model that specifies type of home (`hometype`

): rented room, one-bedroom apartment, two-bedroom apartment, or house.

import statsmodels.api as sm model1 = sm.OLS.from_formula('tv_hours ~ books + hometype', data=tv).fit() print(model1.params) # Output # Intercept 22.952692 # hometype[T.2BR] 11.754848 # hometype[T.house] 5.878007 # hometype[T.room] -5.880337 # books -0.093223

This model leads to the opposite conclusion: the negative coefficient on `books`

indicates that more books is associated with FEWER hours of television watched. How is this possible?

Let’s look at a scatter plot of `tv_hours`

(y-axis) against `books`

(x-axis) colored by `hometype`

. This time we’ll allow `lmplot`

to fit regression lines for each type of home by NOT setting `fit_reg`

to `False`

. We’ll also add the regression line for `model0`

, which only has `books`

as a predictor.

import seaborn as sns import matplotlib.pyplot as plt # Scatter plot sns.lmplot(x='books', y='tv_hours', hue='hometype', palette='colorblind', markers=['o', 'x', 'v','s'], ci=False, legend=False, data=tv) # Black line plt.plot(tv.books, model0.params[0] + model0.params[1]*tv.books, color='black', linewidth=3, label='Only Books') plt.legend() plt.show()

Broadly, the points slope upward (black line), but the points slope downward within each `hometype`

group (colored lines).

This could be because bigger homes mean more room for a larger book collection. Or maybe certain home types are associated with higher incomes, so people have more time and money for television. We can examine the pattern of association from the model, and possibly make good predictions from new data with it, but we can’t find out the cause from this type of analysis.

### Instructions

**1.**

The dataset `sat`

has been loaded for you in **script.py**. This dataset examines U.S. school expenditures and SAT scores by states from the 1994-1995 school year and includes the following variables:

`total`

, average total of math and verbal scores on the SAT`expend`

, average expenditure per pupil in thousands of dollars`takingR`

, percentage of eligible students taking the SAT (we rounded this to the nearest 5 to simplify plotting)

In **script.py** we’ve provided code to fit a linear regression model of average total SAT score predicted from average expenditure per pupil. The fitted model is saved as `model0`

. Print the regression coefficients from `model0`

and inspect them.

What is the relationship between SAT score and expenditure per pupil? Is this what you expected?

**2.**

Now fit the same model but with `takingR`

as an additional predictor and save the fitted model as `model1`

.

**3.**

Print the coefficients of `model1`

using `.params`

and inspect the coefficient on `expend`

. How does it differ from the coefficient in `model0`

? What does this model say about the relationship between average expenditure and SAT scores?

**4.**

We have provided code in **script.py** to generate a scatter plot with `expend`

on the x-axis and `total`

on the y-axis. The points are colored by `takingR`

, and we’ve plotted regression lines for each value of `takingR`

.

Add a line to the plot that shows the regression line for `model0`

in black. Be sure to call the model coefficients directly from the output of step 1.

Not taking into account the percentage of students taking the test, how does the broad pattern of points reflect the results of `model0`

? How do the regression lines within each group reflect the results of `model1`

?