Let’s again suppose that we want to use the StreetEasy data to predict rental prices in NYC. We have the following two models that we want to compare:
# Fit model 1 model1 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + min_to_subway', data=rentals).fit() # Fit model 2 model2 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + min_to_subway + borough', data=rentals).fit() # Print out R-squared for both models print(model1.rsquared) # Output: 0.664 print(model2.rsquared) # Output: 0.728
Note that these models both use bedrooms
, size_sqft
, and min_to_subway
as predictors; but model2
uses borough
as well. Because all of the predictors in model1
are also contained in model2
, these are called nested models.
It turns out that larger nested models will ALWAYS have higher R-squared than their smaller counterparts. However, adding a lot of additional predictors can lead to a different issue: over-fitting. To understand the intuition behind why overfitting is problematic, consider the following plot of rental prices vs. number of bathrooms. We can perfectly predict each datapoint if we fit the zig-zagging line shown below:
However, imagine that we collect a new sample of apartments in NYC and record the number of bathrooms in each. Then, suppose we want to use our model to predict rental prices. Even if the overall relationship between bathrooms and rent is the same in our new data, the exact values will be slightly different. Predictions based on the zig-zag line may be less accurate because the model was so heavily influenced by the quirks of the data we originally collected. A straight line through the middle of the points is actually more useful.
Instructions
Using the bikes
dataset, fit a model to predict cnt
(the number of bike rentals) based on the temperature (temp
), windspeed (windspeed
), and whether or not it is a holiday (holiday
). Save the fitted model as model1
.
Now fit a second model with cnt
as the outcome variable and all the same predictors as in model1
plus humidity (hum
). Save the fitted model as model2
.
Now fit a third model with cnt
as the outcome variable and all the same predictors as in model2
plus the day of the week (weekday
). Save the fitted model as model3
.
Print out the R-squared for all three models. Notice that the R-squared increases from model1
to model2
and from model2
to model3
, even though the increase is very small.
This means that adding more predictors allows us to explain only a TINY bit more variation in the number of bike rentals. This may mean that we are overfitting the model to the data.
How could we decide which model to use? Are these extra predictors helping or hurting the model? Let’s keep exploring this in the next exercise!