So far, we’ve used R-squared, adjusted R-squared, and an F-test to compare models. These criteria are most useful for finding a model that best fits an observed set of data. They are often used when our goal is interpreting a model to understand relationships between variables.

If our goal is to choose the best model for making predictions for new/unobserved data, we may want to use a likelihood based criteria instead.

Log-likelihood of a linear regression model essentially measures the probability of observing our data given a particular model. Higher log-likelihood is better.

For example, we can compare two models based on log likelihood as follows:

model1 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + min_to_subway', data=rentals).fit() model2 = sm.OLS.from_formula('rent ~ bathrooms + building_age_yrs + borough', data=rentals).fit() print(model1.llf) #Output: -44282.327 print(model2.llf) #Output: -44740.623

Because model 1 has a higher log-likelihood (a smaller negative number is larger), we would choose model 1 over model 2.



Using the bikes dataset, fit a model to predict the number of bike rentals (cnt) with the following predictors: temperature (temp), windspeed (windspeed), and whether or not it is a holiday (holiday). Save the fitted model as model1.


Now fit a second model to predict cnt using the following predictors: humidity (hum), season (season), and the day of the week (weekday).


Print out the log-likelihood for both models.


Based on the log likelihood values, which model would you choose? Indicate your answer by setting a variable named which_model equal to 1 if you would choose model1 and equal to 2 if you would choose model2.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?