Let’s again suppose that we want to use the StreetEasy data to predict rental prices in NYC. We have the following two models that we want to compare:

# Fit model 1 model1 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + min_to_subway', data=rentals).fit() # Fit model 2 model2 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + min_to_subway + borough', data=rentals).fit() # Print out R-squared for both models print(model1.rsquared) # Output: 0.664 print(model2.rsquared) # Output: 0.728

Note that these models both use bedrooms, size_sqft, and min_to_subway as predictors; but model2 uses borough as well. Because all of the predictors in model1 are also contained in model2, these are called nested models.

It turns out that larger nested models will ALWAYS have higher R-squared than their smaller counterparts. However, adding a lot of additional predictors can lead to a different issue: over-fitting. To understand the intuition behind why overfitting is problematic, consider the following plot of rental prices vs. number of bathrooms. We can perfectly predict each datapoint if we fit the zig-zagging line shown below:

plot showing rent vs. bedrooms for 9 apartments. The points are all connected by a zig-zagging dotted line.

However, imagine that we collect a new sample of apartments in NYC and record the number of bathrooms in each. Then, suppose we want to use our model to predict rental prices. Even if the overall relationship between bathrooms and rent is the same in our new data, the exact values will be slightly different. Predictions based on the zig-zag line may be less accurate because the model was so heavily influenced by the quirks of the data we originally collected. A straight line through the middle of the points is actually more useful.



Using the bikes dataset, fit a model to predict cnt (the number of bike rentals) based on the temperature (temp), windspeed (windspeed), and whether or not it is a holiday (holiday). Save the fitted model as model1.


Now fit a second model with cnt as the outcome variable and all the same predictors as in model1 plus humidity (hum). Save the fitted model as model2.


Now fit a third model with cnt as the outcome variable and all the same predictors as in model2 plus the day of the week (weekday). Save the fitted model as model3.


Print out the R-squared for all three models. Notice that the R-squared increases from model1 to model2 and from model2 to model3, even though the increase is very small.

This means that adding more predictors allows us to explain only a TINY bit more variation in the number of bike rentals. This may mean that we are overfitting the model to the data.

How could we decide which model to use? Are these extra predictors helping or hurting the model? Let’s keep exploring this in the next exercise!

Take this course for free

By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.
Already have an account?