Learn

In the previous exercises, we compared nested models based on adjusted R-squared.

Another way to compare nested models is by using a hypothesis test called an F-test. Suppose we want to compare the following two models:

``````model1 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft', data=rentals).fit()

model2 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + building_age_yrs + has_elevator', data=rentals).fit()``````

Note that the second model has two more predictors than the first (`building_age_yrs` and `has_elevator`). For an F-test comparing these two models:

• The null hypothesis is that the coefficients on `building_age_yrs` and `has_elevator` are equal to zero (they are not useful in explaining the observed variation in rent).
• The alternative hypothesis is that least one of the coefficients is non-zero.

We can run the test in Python as follows:

``````from statsmodels.stats.anova import anova_lm
anova_results = anova_lm(model1, model2)
print(anova_results)``````

Output:

df_resid ssr df_diff ss_diff F Pr(>F)
0 4997.0 1.4e+10 0.0 NaN NaN NaN
1 4995.0 1.4e+10 2.0 9.2e+08 170.9 1.6e-72

The p-value (`1.6e-72`, which is equal to .00000..[72 total zeros]..16) is located in the bottom right corner of this table. The column name `Pr(>F)` means “the probability of observing an F statistic greater than observed (170.9) if the null hypothesis is true”.

Using a significance threshold of 0.05, the p-value is below the threshold. Therefore, we would conclude that either (or both) of the coefficients on `building_age_yrs` and `has_elevator` is non-zero. Thus, including at least one of these two predictors significantly improves the model.

This would lead us to choose `model2` over `model1`. After running this test, we might also want to separately compare a model with `building_age_yrs` and a model with `has_elevator` to see whether both are necessary. We could do this with separate F-tests or adjusted R-squared.

### Instructions

1.

Using the `bikes` data, fit a model to predict the number of bike rentals (`cnt`) with the following predictors: `temp`, `hum`, and `windspeed`. Save the fitted model as `model1`.

2.

Now fit a second model with all the same predictors plus the day of the week (`weekday`) and “feels like” temperature (`atemp`). Save this fitted model as `model2`.

3.

Run an F-test to compare the two models and print out the results.

NOTE: you may want to round the results using `.round(2)` so that they fit in the output terminal. Otherwise, you’ll need to scroll to the right in the output table to see the p-value.

4.

Based on the F-test and a significance threshold of 0.05, which model would you choose? Indicate your answer by setting a variable named `which_model` equal to `1` if you would choose `model1` and equal to `2` if you would choose `model2`.