Learn

Let’s again suppose that we want to use the StreetEasy data to predict rental prices in NYC. We have the following two models that we want to compare:

``````# Fit model 1
model1 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + min_to_subway', data=rentals).fit()

# Fit model 2
model2 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + min_to_subway + borough', data=rentals).fit()

# Print out R-squared for both models
print(model1.rsquared) # Output: 0.664
print(model2.rsquared) # Output: 0.728``````

Note that these models both use `bedrooms`, `size_sqft`, and `min_to_subway` as predictors; but `model2` uses `borough` as well. Because all of the predictors in `model1` are also contained in `model2`, these are called nested models.

It turns out that larger nested models will ALWAYS have higher R-squared than their smaller counterparts. However, adding a lot of additional predictors can lead to a different issue: over-fitting. To understand the intuition behind why overfitting is problematic, consider the following plot of rental prices vs. number of bathrooms. We can perfectly predict each datapoint if we fit the zig-zagging line shown below: However, imagine that we collect a new sample of apartments in NYC and record the number of bathrooms in each. Then, suppose we want to use our model to predict rental prices. Even if the overall relationship between bathrooms and rent is the same in our new data, the exact values will be slightly different. Predictions based on the zig-zag line may be less accurate because the model was so heavily influenced by the quirks of the data we originally collected. A straight line through the middle of the points is actually more useful.

### Instructions

1.

Using the `bikes` dataset, fit a model to predict `cnt` (the number of bike rentals) based on the temperature (`temp`), windspeed (`windspeed`), and whether or not it is a holiday (`holiday`). Save the fitted model as `model1`.

2.

Now fit a second model with `cnt` as the outcome variable and all the same predictors as in `model1` plus humidity (`hum`). Save the fitted model as `model2`.

3.

Now fit a third model with `cnt` as the outcome variable and all the same predictors as in `model2` plus the day of the week (`weekday`). Save the fitted model as `model3`.

4.

Print out the R-squared for all three models. Notice that the R-squared increases from `model1` to `model2` and from `model2` to `model3`, even though the increase is very small.

This means that adding more predictors allows us to explain only a TINY bit more variation in the number of bike rentals. This may mean that we are overfitting the model to the data.

How could we decide which model to use? Are these extra predictors helping or hurting the model? Let’s keep exploring this in the next exercise!