In the last exercise, we explored how a coefficient may change when additional predictors are added to a model. When adding predictors entirely reverses the sign of the coefficient, this is called Simpson’s Paradox: one model says the direction of the relationship is positive, while the other says it is negative.
Let’s look at a simple example of this with the fictional dataset
tv. We’ll start by fitting a simple regression predicting weekly hours of television watched (
tv_hours) from the number of books in the home (
import statsmodels.api as sm model0 = sm.OLS.from_formula('tv_hours ~ books', data=tv).fit() print(model0.params) # Output: # Intercept 12.219967 # books 0.089078
The coefficient on
books is positive: the model indicates that more books is associated with more weekly hours of television. Now, let’s add a categorical variable to the model that specifies type of home (
hometype): rented room, one-bedroom apartment, two-bedroom apartment, or house.
import statsmodels.api as sm model1 = sm.OLS.from_formula('tv_hours ~ books + hometype', data=tv).fit() print(model1.params) # Output # Intercept 22.952692 # hometype[T.2BR] 11.754848 # hometype[T.house] 5.878007 # hometype[T.room] -5.880337 # books -0.093223
This model leads to the opposite conclusion: the negative coefficient on
books indicates that more books is associated with FEWER hours of television watched. How is this possible?
Let’s look at a scatter plot of
tv_hours (y-axis) against
books (x-axis) colored by
hometype. This time we’ll allow
lmplot to fit regression lines for each type of home by NOT setting
False. We’ll also add the regression line for
model0, which only has
books as a predictor.
import seaborn as sns import matplotlib.pyplot as plt # Scatter plot sns.lmplot(x='books', y='tv_hours', hue='hometype', palette='colorblind', markers=['o', 'x', 'v','s'], ci=False, legend=False, data=tv) # Black line plt.plot(tv.books, model0.params + model0.params*tv.books, color='black', linewidth=3, label='Only Books') plt.legend() plt.show()
Broadly, the points slope upward (black line), but the points slope downward within each
hometype group (colored lines).
This could be because bigger homes mean more room for a larger book collection. Or maybe certain home types are associated with higher incomes, so people have more time and money for television. We can examine the pattern of association from the model, and possibly make good predictions from new data with it, but we can’t find out the cause from this type of analysis.
sat has been loaded for you in script.py. This dataset examines U.S. school expenditures and SAT scores by states from the 1994-1995 school year and includes the following variables:
total, average total of math and verbal scores on the SAT
expend, average expenditure per pupil in thousands of dollars
takingR, percentage of eligible students taking the SAT (we rounded this to the nearest 5 to simplify plotting)
In script.py we’ve provided code to fit a linear regression model of average total SAT score predicted from average expenditure per pupil. The fitted model is saved as
model0. Print the regression coefficients from
model0 and inspect them.
What is the relationship between SAT score and expenditure per pupil? Is this what you expected?
Now fit the same model but with
takingR as an additional predictor and save the fitted model as
Print the coefficients of
.params and inspect the coefficient on
expend. How does it differ from the coefficient in
model0? What does this model say about the relationship between average expenditure and SAT scores?
We have provided code in script.py to generate a scatter plot with
expend on the x-axis and
total on the y-axis. The points are colored by
takingR, and we’ve plotted regression lines for each value of
Add a line to the plot that shows the regression line for
model0 in black. Be sure to call the model coefficients directly from the output of step 1.
Not taking into account the percentage of students taking the test, how does the broad pattern of points reflect the results of
model0? How do the regression lines within each group reflect the results of