So far, we’ve used R-squared, adjusted R-squared, and an F-test to compare models. These criteria are most useful for finding a model that best fits an observed set of data. They are often used when our goal is interpreting a model to understand relationships between variables.
If our goal is to choose the best model for making predictions for new/unobserved data, we may want to use a likelihood based criteria instead.
Log-likelihood of a linear regression model essentially measures the probability of observing our data given a particular model. Higher log-likelihood is better.
For example, we can compare two models based on log likelihood as follows:
model1 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + min_to_subway', data=rentals).fit() model2 = sm.OLS.from_formula('rent ~ bathrooms + building_age_yrs + borough', data=rentals).fit() print(model1.llf) #Output: -44282.327 print(model2.llf) #Output: -44740.623
Because model 1 has a higher log-likelihood (a smaller negative number is larger), we would choose model 1 over model 2.
bikes dataset, fit a model to predict the number of bike rentals (
cnt) with the following predictors: temperature (
temp), windspeed (
windspeed), and whether or not it is a holiday (
holiday). Save the fitted model as
Now fit a second model to predict
cnt using the following predictors: humidity (
hum), season (
season), and the day of the week (
Print out the log-likelihood for both models.
Based on the log likelihood values, which model would you choose? Indicate your answer by setting a variable named
which_model equal to
1 if you would choose
model1 and equal to
2 if you would choose