At the beginning of this lesson, we mentioned that the relationship between variables often does not appear to be a straight line and may instead be somewhat curved. In this case, we may produce a better fitting regression model by adding a polynomial term that raises our predictor variable to a higher exponent to better account for the curving.
In the Python library
statsmodels.api, polynomial terms can be added to a multiple regression model formula by adding a term with the predictor of interest raised to a higher power. This may be done using the NumPy function
np.power and specifying the predictor name and degree.
For example, the plot above suggests a curvilinear relationship between happiness (
happy) and hours slept on average (
sleep). We can add a quadratic (squared) term for the variable
sleep by doing the following:
import statsmodels.api as sm import numpy as np modelP = sm.OLS.from_formula('happy ~ sleep + np.power(sleep,2)', data=happiness).fit() print(modelP.params) # Output: # Intercept -0.058995 # sleep 1.320429 # np.power(sleep, 2) -0.061827
This creates a second predictor term in the model with an additional coefficient. Correspondingly, the new term shows up in our model equation as
happy = -.06 + 1.32*sleep - .06*sleep2
We can check happiness scores by substituting in different values of sleep.
For 2 hours of sleep:
happy = -.06 + 1.32*2 - .06*22 = 2.34
For 10 hours of sleep:
happy = -.06 + 1.32*10 - .06*102 = 7.14
For 14 hours of sleep:
happy = -.06 + 1.32*14 - .06*142 = 6.66
Our curved model picks up on the pattern that beyond about 10 hours of sleep, more sleep is not associated with greater happiness. Perhaps the people who sleep a lot are ill so they need more sleep and are less happy, or perhaps sleeping too much causes problems in other parts of their lives that takes away from their happiness. We can’t know the exact cause from our model, just the association.
Note that we can use Python and our model results to perform the above computations for us. Below is code to compute
sleep is 10 and 14. Since we use exact rather than rounded numbers here, the results are slightly different but more accurate than our original work.
print(modelP.params + modelP.params*10 + modelP.params*np.power(10,2)) # Output: # 6.9626120786831445 print(modelP.params + modelP.params*14 + modelP.params*np.power(14,2)) # Output: # 6.308953414816589
plants dataset has been loaded for you in script.py. Run a regression predicting number of dead leaves (
dead) from amount of light (
light) with an additional squared term for
light. Save the results as
Print the coefficients of
model. Are you surprised that the terms with
light in them have differing signs?
Use the model coefficients to find the predicted number of dead leaves for
light = 10. Instead of using rounded versions of the coefficients from the model output, call the coefficients directly using the index numbers of
model.params and save this computation in script.py as
numDead10. Repeat this process for when
light = 18 and save the computation as
Here is the code to do this for when
light = 5 and save the computation as
numDead5 = model.params + model.params*5 + model.params*np.power(5,2)
We can print these values to inspect the pattern of dead leaves as light increases. Here’s the code to print
numDead18. Based on these three predicted values of
dead, what does the pattern of outcomes say about what might be the best amount of light to keep leaves from dying?