Learn how to compare and choose between linear regression models.

In this lesson, we'll discuss some of the ways we can compare and choose linear regression models using a variety of different methods.

For example, suppose that we work at a bike rental company and have a dataset where each row represents a unique day of business (including info about the number of bikes that were rented, the weather, the day of the week, etc.). We might want to use this dataset to predict how many bikes will be rented tomorrow (so we can plan). Alternatively, we might want to use the dataset to understand which factors are most predictive of bike usage.

For either goal, we can use a linear regression model. The problem is: there are many different models that we could create. How do we know which one to use?

This lesson will focus on some common ways to compare and choose a linear model, both for prediction and data analysis.

Introduction

*R-squared* is one of the most common metrics to evaluate linear regression models. We can interpret R-squared as the proportion of variation in an outcome variable that is explained by a linear regression model. More explained variation is generally better.

For example, suppose we have a dataset containing information about apartment rentals for NYC apartments. We can build two different models to predict rental price and print out the R-Squared for each model as follows:

```python
# Create and fit the first model to predict rent
model1 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + min_to_subway', data=rentals).fit()

# Create and fit the second model
model2 = sm.OLS.from_formula('rent ~ bathrooms + building_age_yrs + borough', data=rentals).fit()

# Print out R-squared for both models
print(model1.rsquared) #Output: 0.664
print(model2.rsquared) #Output: 0.596
```
This tells us that the first model (using bedrooms, square-footage, and minutes to the subway) explains about 66.4% of the variation in rental prices, whereas the second model only explains about 59.6% of the variation. This would lead us to choose the first model over the second.

R-Squared

Let's again suppose that we want to use the StreetEasy data to predict rental prices in NYC. We have the following two models that we want to compare:

```python
# Fit model 1
model1 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + min_to_subway', data=rentals).fit()

# Fit model 2
model2 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + min_to_subway + borough', data=rentals).fit()

# Print out R-squared for both models
print(model1.rsquared) # Output: 0.664
print(model2.rsquared) # Output: 0.728
```
Note that these models both use `bedrooms`, `size_sqft`, and `min_to_subway` as predictors; but `model2` uses `borough` as well. Because all of the predictors in `model1` are also contained in `model2`, these are called *nested models*.

It turns out that larger nested models will ALWAYS have higher R-squared than their smaller counterparts. However, adding a lot of additional predictors can lead to a different issue: over-fitting. To understand the intuition behind why overfitting is problematic, consider the following plot of rental prices vs. number of bathrooms. We can perfectly predict each datapoint if we fit the zig-zagging line shown below:

![plot showing rent vs. bedrooms for 9 apartments. The points are all connected by a zig-zagging dotted line.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/choosing-linear-regression-model/overfitting.svg)

However, imagine that we collect a new sample of apartments in NYC and record the number of bathrooms in each. Then, suppose we want to use our model to predict rental prices. Even if the overall relationship between bathrooms and rent is the same in our new data, the exact values will be slightly different. Predictions based on the zig-zag line may be less accurate because the model was so heavily influenced by the quirks of the data we originally collected. A straight line through the middle of the points is actually more useful.

R-Squared and the Dangers of Overfitting

While R-squared is useful for comparing models with different sets of predictors, we saw that it could lead to overfitting when choosing between nested models. 

To address this issue, we can instead use adjusted R-squared, which gives a small penalty for each additional predictor in a model. For example:

```python
model1 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + borough', data=rentals).fit()

model2 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + borough + has_doorman', data=rentals).fit()

print(model1.rsquared) #Output: 0.72761
print(model2.rsquared) #Output: 0.72765

print(model1.rsquared_adj) #Output: 0.72739
print(model2.rsquared_adj) #Output: 0.72738
```
  
Note that the second model (with an additional predictor) has a slightly larger R-squared, but a slightly smaller adjusted R-squared, compared to the first model. Based on the adjusted R-squared, we would choose the smaller model.

Adjusted R-Squared

In the previous exercises, we compared nested models based on adjusted R-squared.

Another way to compare nested models is by using a hypothesis test called an F-test. Suppose we want to compare the following two models:

```python
model1 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft', data=rentals).fit()

model2 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + building_age_yrs + has_elevator', data=rentals).fit()
```
Note that the second model has two more predictors than the first (`building_age_yrs` and `has_elevator`). For an F-test comparing these two models:

- The *null hypothesis* is that the coefficients on `building_age_yrs` and `has_elevator` are equal to zero (they are not useful in explaining the observed variation in rent).
- The *alternative hypothesis* is that least one of the coefficients is non-zero.

We can run the test in Python as follows:

```python
from statsmodels.stats.anova import anova_lm
anova_results = anova_lm(model1, model2)
print(anova_results)
```

Output: 

|   | df_resid | ssr     | df_diff | ss_diff | F     | Pr(>F)  |
|---|----------|---------|---------|---------|-------|---------|
| 0 | 4997.0   | 1.4e+10 | 0.0     | NaN     | NaN   | NaN     |
| 1 | 4995.0   | 1.4e+10 | 2.0     | 9.2e+08 | 170.9 | 1.6e-72 |

<br>

The p-value (`1.6e-72`, which is equal to .00000..[72 total zeros]..16) is located in the bottom right corner of this table. The column name `Pr(>F)` means "the probability of observing an F statistic greater than observed (170.9) if the null hypothesis is true".

Using a significance threshold of 0.05, the p-value is below the threshold. Therefore, we would conclude that either (or both) of the coefficients on `building_age_yrs` and `has_elevator` is non-zero. Thus, including at least one of these two predictors significantly improves the model.

This would lead us to choose `model2` over `model1`. After running this test, we might also want to separately compare a model with `building_age_yrs` and a model with `has_elevator` to see whether both are necessary. We could do this with separate F-tests or adjusted R-squared.

F-Tests

So far, we've used R-squared, adjusted R-squared, and an F-test to compare models. These criteria are most useful for finding a model that best fits an observed set of data. They are often used when our goal is interpreting a model to understand relationships between variables.

If our goal is to choose the best model for making predictions for new/unobserved data, we may want to use a likelihood based criteria instead.

*Log-likelihood* of a linear regression model essentially measures the probability of observing our data given a particular model. Higher log-likelihood is better. 

For example, we can compare two models based on log likelihood as follows:

```python
model1 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + min_to_subway', data=rentals).fit()

model2 = sm.OLS.from_formula('rent ~ bathrooms + building_age_yrs + borough', data=rentals).fit()

print(model1.llf) #Output: -44282.327
print(model2.llf) #Output: -44740.623
```

Because model 1 has a higher log-likelihood (a smaller negative number is larger), we would choose model 1 over model 2.

Log-Likelihood

Similarly to R-squared, log-likelihood only increases as we add more predictors to a model. In the same way that adjusted R-squared penalizes R-squared for more predictors, there are criteria that penalize the log-likelihood for more predictors.

The two most commonly used are *Akaike information criterion (AIC)* and *Bayesian information criterion (BIC)*. Both AIC and BIC use negative log-likelihood, so we actually want the model with the LOWEST AIC or BIC.

AIC and BIC are similar, but BIC gives a bigger penalty for each additional predictor, so it is used for finding the best "simple" model. This is useful because it makes the model more interpretable. For example:

```python
model1 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + borough', data=rentals).fit()

model2 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + borough + has_doorman', data=rentals).fit()

print(model1.llf) #Output: -43756.418
print(model2.llf) #Output: -43756.017

print(model1.aic) #Output: 87522.837
print(model2.aic) #Output: 87524.034

print(model1.bic) #Output: 87555.423
print(model2.bic) #Output: 87563.137
```

We see that the log-likelihood for model 2 is slightly larger (better), but the AIC for model 2 is slightly larger (worse), and BIC even more so. Both AIC and BIC would lead us to choose model 1, whereas log-likelihood would lead us to choose model 2.

AIC and BIC

Another way of choosing a model to make predictions for new data (also called out-of-sample prediction) is by using *training* and *test* datasets. The idea is that we only use PART of our data to fit the model, then we see how well the model performs in predicting the outcome of interest for the rest of our data. The process is as follows: 

* First, we split our data into two subsets: a training set and a test set. Often, the training set is a larger proportion of the data.
* In other Python libraries, there are built-in functions to split a dataframe, but for the sake of understanding, we'll do it explicitly here by randomly sampling from a list of row indices:

```python
import numpy as np

# Create a list of indices
indices = range(len(rentals))

# Determine the size of the training set (s)
s = int(0.8*len(indices))

# Randomly select 80% of the indices
train_ind = np.random.choice(indices, size = s, replace = False)

# Create a list of the remaining 20% of indices
test_ind = list(set(indices) - set(train_ind))

# Split the data into the training and test sets
rentals_train = rentals.iloc[train_ind]
rentals_test = rentals.iloc[test_ind]
```

* Next, we fit the models we want to compare using the training set data only: 

```python
model1 = sm.OLS.from_formula('rent ~ bedrooms + bathrooms + size_sqft + min_to_subway + floor', data=rentals_train).fit()

model2 = sm.OLS.from_formula('rent ~ bedrooms + bathrooms + size_sqft + min_to_subway + floor + borough', data=rentals_train).fit()
```

* Then, we use those models to predict the rental price for the apartments in the test set:

```python
fitted1 = model1.predict(rentals_test)
fitted2 = model2.predict(rentals_test)
```

* Finally, we can compare the predicted rents to the true rents in the test set and use a metric to determine how well each model performed.
* In this example, we'll use a metric called *predictive root mean squared error (PRMSE)*, which is exactly what the name sounds like: the square root of the mean squared difference between predicted and true values of the outcome variable. A smaller PRMSE means that the model performed better (the predicted values were more similar to the true values):

```python
true = rentals_test.rent
prmse1 = np.mean((true-fitted1)**2)**.5
prmse2 = np.mean((true-fitted2)**2)**.5
print(prmse1) #output: 1326.258
print(prmse2) #output: 1224.269
```

Based on this metric, we would choose the second model over the first one because it has a smaller PRMSE.

Training and Test Sets

Congratulations! In this lesson, you've learned a number of different methods for model comparison:

* For choosing a model that best represents the data we have:
  * R-squared
  * Adjusted R-squared
  * F-test
* For choosing a model for accurate out-of-sample prediction:
  * Log likelihood
  * AIC/BIC
  * Training/test sets

Note that we've covered many different methods for choosing a model and they don't always agree. In order to choose a method, it's important to consider your ultimate goal (analysis vs. prediction) and what you want to prioritize (simplicity and interpretability vs. accuracy)

Review

Choosing a Linear Regression Model

In this module, you'll learn how to choose the best linear regression model for a particular research question. Whether you're interested in prediction or want to understand relationships between multiple variables at once, this module will help you understand common metrics for evaluating and comparing linear regression models.

Learn how to choose the best linear regression model for a particular research question.

Congratulations, you’ve successfully completed the How to Choose a Linear Regression Model course! You've learned how to evaluate your models and confidently choose the right one!

Your learning journey into Linear Regression isn't over yet! Here is our roadmap to mastering Linear Regression:

* [How to Choose a Linear Regression Model](https://www.codecademy.com/learn/how-to-choose-a-linear-regression-model-course) <-- Completed!
* [Master Statistics with Python](https://www.codecademy.com/learn/paths/master-statistics-with-python) <-- Learn More!


Once again, congratulations on finishing the How to Choose a Linear Regression Model course! We are excited to see what you accomplish next.

You’ve completed How to Choose a Linear Regression Model! What’s next?

Next Steps

### Introduction

Statsmodels and scikit-learn are two commonly used packages for linear regression in Python. In this course, we have focused on implementation using statsmodels; however, it is useful to be able to fit models using multiple tools because each one provides functionality that may not be available in the other.

For context, statsmodels was built as an extension to the `scipy.stats` module so as to enable R-like functionality in Python to perform statistical model implementation, testing, and inference. Scikit-learn is built on NumPy and SciPy to enable easy implementation of machine learning algorithms. It also contains a suite of associated model validation methods to fine-tune a model.

### A Comparison

#### statsmodels

   - Pro: Provides comprehensive model summaries, including t-tests for all the coefficients, R-squared, adjusted R-squared, AIC, BIC, log likelihood, F-test, and more.
   - Pro: Allows users to fit models using a formula-based syntax, which makes it relatively simple to test out interaction terms and polynomial terms, compare multiple models, etc.
   - Con: It is missing some useful functions to easily perform operations on statsmodels model objects (e.g., k-fold validation, train-test split, lasso regression).
   - Con: It is used less widely than scikit-learn, so has less detailed documentation and example code available online.

#### scikit-learn

   - Pro: Contains many easy-to-use functions that can perform operations like k-fold validation, train-test split, etc. in a few lines of code.
   - Pro: Great documentation online and a large community of people who have shared their code and asked/answered questions online.
   - Con: The model object contains more limited information (just coefficients and R-squared).
   - Con: To fit a model, scikit-learn requires users to create the design matrix "by-hand" (or using other libraries), which means it requires an extra step to fit models with categorical variables, interaction terms, and/or polynomial terms.
    

Overall, most people use scikit-learn when performing predictive modeling, but aren’t concerned with examining the coefficients or their associated statistics. Meanwhile, statsmodels is great for comparing and fitting complex models; however, in order to use scikit-learn’s tools like k-fold cross-validation, you may need to transform your statsmodels model object into a scikit-learn model object.
 

### Implementation

To compare these libraries, let's fit some models with each and compare the results:

#### statsmodels

All of these examples use a dataset of air quality measurements, which is available via statsmodels. The code below uses this data to fit a model to predict temperature (`Temp`) based on ozone levels (`Ozone`), windspeed (`Wind`) and an interaction between `Ozone` and `Wind`.

```python
# Load libraries
import statsmodels.api as sm

# Get some data
data = sm.datasets.get_rdataset('airquality').data
data.dropna(inplace=True)

# Fit model
model = sm.OLS.from_formula('Temp ~ Ozone + Wind + Ozone:Wind', data).fit()
print(model.summary())
```

Output:

```
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   Temp   R-squared:                       0.563
Model:                            OLS   Adj. R-squared:                  0.551
Method:                 Least Squares   F-statistic:                     46.00
Date:                Thu, 08 Apr 2021   Prob (F-statistic):           3.54e-19
Time:                        15:37:34   Log-Likelihood:                -361.26
No. Observations:                 111   AIC:                             730.5
Df Residuals:                     107   BIC:                             741.4
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     79.3074      3.288     24.123      0.000      72.790      85.825
Ozone          0.0202      0.046      0.443      0.659      -0.070       0.111
Wind          -1.0550      0.286     -3.695      0.000      -1.621      -0.489
Ozone:Wind     0.0234      0.006      4.070      0.000       0.012       0.035
==============================================================================
Omnibus:                        4.265   Durbin-Watson:                   1.169
Prob(Omnibus):                  0.119   Jarque-Bera (JB):                3.491
Skew:                          -0.326   Prob(JB):                        0.175
Kurtosis:                       2.426   Cond. No.                     2.22e+03
==============================================================================
```

#### scikit-learn: 

In scikit-learn, it is relatively easy to fit a model with any predictor set that already exists in our data. For example, we can fit a model to predict temperature based on ozone level (`Ozone`) and windspeed (`Wind`) as follows:

```python
from sklearn.linear_model import LinearRegression

X = data[['Ozone', 'Wind']]
y = data[['Temp']]

# Fit model
model = LinearRegression()
model.fit(X, y)
print(model.intercept_)
print(model.coef_)
```

Output:

```
[73.14445315]
[[ 0.18059202 -0.29723628]]
```


However, if we want to add interaction terms, polynomial terms, or anything else more complex, we need to do that ahead of time. For example, if we want to add an interaction between `Ozone` and `Wind` like we did in statsmodels, we can create a new column in our dataset named `OzoneWind`, which is derived by multiplying `Ozone` and `Wind` together. Then, we can add that column to our model and produce the same coefficients as we calculated with statsmodels:

```python
data['OzoneWind']= data.Ozone*data.Wind
X = data[['Ozone', 'Wind', 'OzoneWind']]
y = data[['Temp']]

# Fit model
model = LinearRegression()
model.fit(X, y)
print(model.intercept_)
print(model.coef_)
```

Output:

```
[79.30741717]
[[ 0.02024914 -1.05495668  0.02342465]]
```

Alternatively, we could create the X matrix with formula notation via the patsy module. Note that we have to include a `0 + ` in front of our formula so that patsy doesn't automatically generate a column of `1`s in the `X` matrix for the intercept (`sklearn.linear_model.LinearRegression` does this under the hood). The code to implement this is shown below:

```python
import patsy

# Fit model
y, X = patsy.dmatrices('Temp ~ 0 + Ozone + Wind + Ozone:Wind', data)
model = LinearRegression() 
model.fit(X, y) 
print(model.intercept_)
print(model.coef_)
```

Output:

```
[79.30741717]
[[ 0.02024914 -1.05495668  0.02342465]]
```

Note that we calculated the same intercept and coefficients using statsmodels and scikit-learn. While statsmodels gave us more information about the model and coefficients, there are some operations that are much simpler in scikit-learn. For example, the following lines of code will split our data into training and test sets. There is no function in statsmodels to easily do the same.

```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) 
```

### Practice
<Assessment id="6074504b113b9a00122dbfab" />


### Recap

In this article, you have learned about two different modules for fitting linear regression models in Python. Both statsmodels and scikit-learn have pros and cons for different applications, so there is no right or wrong choice between them! However, understanding these two different implementations will help you become a more flexible statistician, data scientist, or analyst, who can adapt to ever changing technologies and choose an appropriate option for the task at hand.

Learn about the differences between scikit-learn and statsmodels with respect to linear regression in Python.

Linear Models in scikit-learn vs. statsmodels

AIC and BIC include a penalty for additional predictors whereas log-likelihood will always favor larger nested models.

AIC and BIC will find the best model to fit a dataset whereas log-likelihood will find the best model for making predictions for new/unobserved data.

AIC and BIC are used for comparing non-nested models whereas log likelihood is used to compare nested models.

`model2` has an equal or larger R-squared than `model1`.

`model2` has an equal or larger adjusted R-squared than `model1`

`model2` has an equal or smaller AIC than `model1`.

`model2` has an equal or smaller BIC than `model1`.

`'heart_rate ~ age + exercise + altitude'`

`'heart_rate ~ age + exercise + temperature'`

to find the model with the largest possible R-squared

to find the model with the largest possible log likelihood.

Test your knowledge about how to choose a linear regression model!

Compare linear regression models to predict housing prices on Craigslist.

Fit a model that predicts `price` using `type`, `sqfeet`, `beds`, and `baths` as predictors. Save the fitted model as `model1`. 

Fit a model that predicts `price` using `type`, `sqfeet`, `beds`, `baths`, `comes_furnished`, `laundry_options`, `parking_options`, and `smoking_allowed` as predictors. Save the fitted model as `model2`. 

Note that `model1` and `model2` are *nested models* because `model2` contains all of the predictors in `model1`.

Fit a model that predicts `price` using `type`, `sqfeet`, `beds`, `baths`, `comes_furnished`, `laundry_options`, `parking_options`, `smoking_allowed`, `cats_allowed`, and `dogs_allowed` as predictors. Save the fitted model as `model3`. 

Note that `model3`, `model2`, and `model1` are *nested models* because `model2` contains all of the predictors in `model1` and `model3` contains all of the predictors in `model2`.

Print the R-squared for all three models. Approximately what proportion of variation in rental prices can be described using the largest predictor set (`model3`)?

Print out the adjusted R-squared for all three models. Based on adjusted R-squared, which model fits the data best?

Note that the two extra predictors in `model3` (compared to `model2`) are related to pet policies (`cats_allowed` and `dogs_allowed`). Based on your answer to the above: holding all other predictors constant, is there a significant relationship between a housing option's pet policy and its price?

Use the `anova_lm()` function from `statsmodels` (which has already been imported for you in **script.py**) to run an F-test comparing `model2` and `model3`, then print the results. 

Using a significance threshold of 0.05, are the coefficients on `cats_allowed` and `dogs_allowed` significantly different from zero? In other words: holding all other predictors constant, is there a significant relationship between a housing option's pet policy and its price? 

Does your answer based on the F-test match your answer based on adjusted R-squared? Note that these two criteria don't have to agree!



Print the log-likelihood for all three models. Which model has the largest log-likelihood? Does this make sense?


Print the AIC for all three models. Based on AIC, which model fits the data best? 

Would you choose the same model based on AIC as you would based on adjusted R-squared and the F-test?

Print the BIC for all three models. Based on BIC, which model fits the data best? 

Note that BIC tends to favor simpler models with fewer predictors. Would you choose the same model based on BIC as you would based on AIC, adjusted R-squared and the F-test?

We've provided you with code in **script.py** to split the `housing` data into training and test sets. These are saved as `housing_train` and `housing_test`, respectively.

Re-fit `model2` and `model3` using the training dataset and re-save the fitted models as `model2_train` and `model3_train`.

Calculate the fitted values for the test dataset based on `model2_train` and `model3_train`. Save them as `fitted_mod2` and `fitted_mod3`, respectively.

Calculate and print the predicted root mean squared error (PRMSE) for models 2 and 3.

Based on PRMSE, which model performs best with respect to out-of-sample prediction?

In this project, we saw that `model2` and `model3` performed very similarly. `model3` edged out `model2` in most comparisons, but only by a small amount. 

Note that the process of calculating PRMSE involves randomly splitting the data into training and test datasets. Depending on how we split the data, we'll calculate slightly different PRMSE values. If two models have very similar PRMSEs, then different models may "win" depending on how we split the data.

Toward the beginning of **script.py** we've set a random seed using `np.random.seed(1)` to control the way the data is split. Try changing the random seed to a different number besides `1`, then re-run the code. Does model 3 still have a smaller PRMSE?

Try a few more times with different numbers. Can you get a sense for whether model 3 wins out more often &mdash; or is it a toss-up?

Now that you've explored three potential models for housing prices, see if you can improve upon these models using additional terms. Are there any interactions or polynomial terms that you think may improve the model? 

Craigslist Analysis

### Why How to Choose a Linear Regression Model? 
While it is can be easy to make a model, the real science comes in choosing which model best fits your problem, and tuning your model to be just right. This course is an introduction to tools, techniques, and best practices for choosing a linear regression model and how to report your choices. 

### Take-Away Skills 
In this course, you will learn how to decide quantitatively between different models, and evaluate model performance. We will cover both simple and multiple linear regression. You will learn how to interpret your findings, and make recommendations for which models best answer which questions. 

Learn about the differences between different regression models and how to decide which one to use.

How to Choose a Linear Regression Model

PRO SALE: Get 50% off annual Pro memberships using code [LLM50](https://www.codecademy.com/checkout?plan_id=proGoldAnnualV2&discountCode=LLM50&plan_type=pro)