R-squared is one of the most common metrics to evaluate linear regression models. We can interpret R-squared as the proportion of variation in an outcome variable that is explained by a linear regression model. More explained variation is generally better.

For example, suppose we have a dataset containing information about apartment rentals for NYC apartments. We can build two different models to predict rental price and print out the R-Squared for each model as follows:

# Create and fit the first model to predict rent model1 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + min_to_subway', data=rentals).fit() # Create and fit the second model model2 = sm.OLS.from_formula('rent ~ bathrooms + building_age_yrs + borough', data=rentals).fit() # Print out R-squared for both models print(model1.rsquared) #Output: 0.664 print(model2.rsquared) #Output: 0.596

This tells us that the first model (using bedrooms, square-footage, and minutes to the subway) explains about 66.4% of the variation in rental prices, whereas the second model only explains about 59.6% of the variation. This would lead us to choose the first model over the second.



Using the bikes dataset, fit a model to predict cnt (the number of bike rentals) based on the temperature (temp), windspeed (windspeed), and whether or not it is a holiday (holiday). Save the fitted model as model1.


Using the bikes dataset, fit a second model to predict cnt (the number of bike rentals) based on humidity (hum), season (season), and the day of the week (weekday). Save the fitted model as model2.


Print out the R-squared for both models.


Based on the R-squared values, which model would you choose? Indicate your answer by setting a variable named which_model equal to 1 if you would choose model1 and equal to 2 if you would choose model2.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?