Sometimes we use regression to understand the relationship between two variables because we wish to control for potential confounders. For example, based on the survey
dataset, we may be primarily interested in how studying (hours_studied
) is related to test score (score
); however, in order to understand this relationship, we may want to control for additional student attributes, such as whether the student ate breakfast (breakfast
).
If we perform a simple linear regression predicting score
from hours_studied
, we get the following results:
import statsmodels.api as sm model0 = sm.OLS.from_formula('score ~ hours_studied', data=survey).fit() print(model0.params) # Output: # Intercept 34.990700 # hours_studied 11.881045
However, if we add breakfast
to the model and inspect the new coefficients, we’ll find that the intercept and slope on hours_studied
have changed:
import statsmodels.api as sm model1 = sm.OLS.from_formula('score ~ hours_studied + breakfast', data=survey).fit() print(model1.params) # Output: # Intercept 32.665570 # hours_studied 8.540499 # breakfast 22.495615
Note that the coefficient on hours_studied
changes from 11.9 to 8.5. Why does this happen? Perhaps people who eat breakfast are more likely to study longer and also more likely to score better on their exam. Without taking breakfast
into account, some of the relationship between score
and breakfast
is attributed to hours_studied
instead.
Instructions
The student
dataset has been loaded for you in script.py. Run a regression model predicting final Portuguese score (port3
) from first math score (math1
). Save the fitted model as simple
.
Now fit a multiple regression model predicting final Portuguese scores (port3
) from first math score (math1
) AND first Portuguese score (port1
). Save the fitted model as multiple
.
Print the resulting intercept and coefficients from the simple
model using .params
. What is the apparent relationship between final Portuguese score and first math score?
Print the resulting coefficients from the multiple
model using .params
. How did the coefficient on math1
change when port1
was added to the model?