So far, we’ve used R-squared, adjusted R-squared, and an F-test to compare models. These criteria are most useful for finding a model that best fits an observed set of data. They are often used when our goal is interpreting a model to understand relationships between variables.
If our goal is to choose the best model for making predictions for new/unobserved data, we may want to use a likelihood based criteria instead.
Log-likelihood of a linear regression model essentially measures the probability of observing our data given a particular model. Higher log-likelihood is better.
For example, we can compare two models based on log likelihood as follows:
model1 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + min_to_subway', data=rentals).fit() model2 = sm.OLS.from_formula('rent ~ bathrooms + building_age_yrs + borough', data=rentals).fit() print(model1.llf) #Output: -44282.327 print(model2.llf) #Output: -44740.623
Because model 1 has a higher log-likelihood (a smaller negative number is larger), we would choose model 1 over model 2.
Instructions
Using the bikes
dataset, fit a model to predict the number of bike rentals (cnt
) with the following predictors: temperature (temp
), windspeed (windspeed
), and whether or not it is a holiday (holiday
). Save the fitted model as model1
.
Now fit a second model to predict cnt
using the following predictors: humidity (hum
), season (season
), and the day of the week (weekday
).
Print out the log-likelihood for both models.
Based on the log likelihood values, which model would you choose? Indicate your answer by setting a variable named which_model
equal to 1
if you would choose model1
and equal to 2
if you would choose model2
.