Learn how to compare and choose between linear regression models.

In this lesson, we'll discuss some of the ways we can compare and choose linear regression models using a variety of different methods.

For example, suppose that we work at a bike rental company and have a dataset where each row represents a unique day of business (including info about the number of bikes that were rented, the weather, the day of the week, etc.). We might want to use this dataset to predict how many bikes will be rented tomorrow (so we can plan). Alternatively, we might want to use the dataset to understand which factors are most predictive of bike usage.

For either goal, we can use a linear regression model. The problem is: there are many different models that we could create. How do we know which one to use?

This lesson will focus on some common ways to compare and choose a linear model, both for prediction and data analysis.

Introduction

*R-squared* is one of the most common metrics to evaluate linear regression models. We can interpret R-squared as the proportion of variation in an outcome variable that is explained by a linear regression model. More explained variation is generally better.

For example, suppose we have a dataset containing information about apartment rentals for NYC apartments. We can build two different models to predict rental price and print out the R-Squared for each model as follows:

```python
# Create and fit the first model to predict rent
model1 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + min_to_subway', data=rentals).fit()

# Create and fit the second model
model2 = sm.OLS.from_formula('rent ~ bathrooms + building_age_yrs + borough', data=rentals).fit()

# Print out R-squared for both models
print(model1.rsquared) #Output: 0.664
print(model2.rsquared) #Output: 0.596
```
This tells us that the first model (using bedrooms, square-footage, and minutes to the subway) explains about 66.4% of the variation in rental prices, whereas the second model only explains about 59.6% of the variation. This would lead us to choose the first model over the second.

R-Squared

Let's again suppose that we want to use the StreetEasy data to predict rental prices in NYC. We have the following two models that we want to compare:

```python
# Fit model 1
model1 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + min_to_subway', data=rentals).fit()

# Fit model 2
model2 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + min_to_subway + borough', data=rentals).fit()

# Print out R-squared for both models
print(model1.rsquared) # Output: 0.664
print(model2.rsquared) # Output: 0.728
```
Note that these models both use `bedrooms`, `size_sqft`, and `min_to_subway` as predictors; but `model2` uses `borough` as well. Because all of the predictors in `model1` are also contained in `model2`, these are called *nested models*.

It turns out that larger nested models will ALWAYS have higher R-squared than their smaller counterparts. However, adding a lot of additional predictors can lead to a different issue: over-fitting. To understand the intuition behind why overfitting is problematic, consider the following plot of rental prices vs. number of bathrooms. We can perfectly predict each datapoint if we fit the zig-zagging line shown below:

![plot showing rent vs. bedrooms for 9 apartments. The points are all connected by a zig-zagging dotted line.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/choosing-linear-regression-model/overfitting.svg)

However, imagine that we collect a new sample of apartments in NYC and record the number of bathrooms in each. Then, suppose we want to use our model to predict rental prices. Even if the overall relationship between bathrooms and rent is the same in our new data, the exact values will be slightly different. Predictions based on the zig-zag line may be less accurate because the model was so heavily influenced by the quirks of the data we originally collected. A straight line through the middle of the points is actually more useful.

R-Squared and the Dangers of Overfitting

While R-squared is useful for comparing models with different sets of predictors, we saw that it could lead to overfitting when choosing between nested models. 

To address this issue, we can instead use adjusted R-squared, which gives a small penalty for each additional predictor in a model. For example:

```python
model1 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + borough', data=rentals).fit()

model2 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + borough + has_doorman', data=rentals).fit()

print(model1.rsquared) #Output: 0.72761
print(model2.rsquared) #Output: 0.72765

print(model1.rsquared_adj) #Output: 0.72739
print(model2.rsquared_adj) #Output: 0.72738
```
  
Note that the second model (with an additional predictor) has a slightly larger R-squared, but a slightly smaller adjusted R-squared, compared to the first model. Based on the adjusted R-squared, we would choose the smaller model.

Adjusted R-Squared

In the previous exercises, we compared nested models based on adjusted R-squared.

Another way to compare nested models is by using a hypothesis test called an F-test. Suppose we want to compare the following two models:

```python
model1 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft', data=rentals).fit()

model2 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + building_age_yrs + has_elevator', data=rentals).fit()
```
Note that the second model has two more predictors than the first (`building_age_yrs` and `has_elevator`). For an F-test comparing these two models:

- The *null hypothesis* is that the coefficients on `building_age_yrs` and `has_elevator` are equal to zero (they are not useful in explaining the observed variation in rent).
- The *alternative hypothesis* is that least one of the coefficients is non-zero.

We can run the test in Python as follows:

```python
from statsmodels.stats.anova import anova_lm
anova_results = anova_lm(model1, model2)
print(anova_results)
```

Output: 

| | df_resid | ssr | df_diff | ss_diff | F | Pr(>F) |
|---|----------|---------|---------|---------|-------|---------|
| 0 | 4997.0 | 1.4e+10 | 0.0 | NaN | NaN | NaN |
| 1 | 4995.0 | 1.4e+10 | 2.0 | 9.2e+08 | 170.9 | 1.6e-72 |

 

The p-value (`1.6e-72`, which is equal to .00000..[72 total zeros]..16) is located in the bottom right corner of this table. The column name `Pr(>F)` means "the probability of observing an F statistic greater than observed (170.9) if the null hypothesis is true".

Using a significance threshold of 0.05, the p-value is below the threshold. Therefore, we would conclude that either (or both) of the coefficients on `building_age_yrs` and `has_elevator` is non-zero. Thus, including at least one of these two predictors significantly improves the model.

This would lead us to choose `model2` over `model1`. After running this test, we might also want to separately compare a model with `building_age_yrs` and a model with `has_elevator` to see whether both are necessary. We could do this with separate F-tests or adjusted R-squared.

F-Tests

So far, we've used R-squared, adjusted R-squared, and an F-test to compare models. These criteria are most useful for finding a model that best fits an observed set of data. They are often used when our goal is interpreting a model to understand relationships between variables.

If our goal is to choose the best model for making predictions for new/unobserved data, we may want to use a likelihood based criteria instead.

*Log-likelihood* of a linear regression model essentially measures the probability of observing our data given a particular model. Higher log-likelihood is better. 

For example, we can compare two models based on log likelihood as follows:

```python
model1 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + min_to_subway', data=rentals).fit()

model2 = sm.OLS.from_formula('rent ~ bathrooms + building_age_yrs + borough', data=rentals).fit()

print(model1.llf) #Output: -44282.327
print(model2.llf) #Output: -44740.623
```

Because model 1 has a higher log-likelihood (a smaller negative number is larger), we would choose model 1 over model 2.

Log-Likelihood

Similarly to R-squared, log-likelihood only increases as we add more predictors to a model. In the same way that adjusted R-squared penalizes R-squared for more predictors, there are criteria that penalize the log-likelihood for more predictors.

The two most commonly used are *Akaike information criterion (AIC)* and *Bayesian information criterion (BIC)*. Both AIC and BIC use negative log-likelihood, so we actually want the model with the LOWEST AIC or BIC.

AIC and BIC are similar, but BIC gives a bigger penalty for each additional predictor, so it is used for finding the best "simple" model. This is useful because it makes the model more interpretable. For example:

```python
model1 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + borough', data=rentals).fit()

model2 = sm.OLS.from_formula('rent ~ bedrooms + size_sqft + borough + has_doorman', data=rentals).fit()

print(model1.llf) #Output: -43756.418
print(model2.llf) #Output: -43756.017

print(model1.aic) #Output: 87522.837
print(model2.aic) #Output: 87524.034

print(model1.bic) #Output: 87555.423
print(model2.bic) #Output: 87563.137
```

We see that the log-likelihood for model 2 is slightly larger (better), but the AIC for model 2 is slightly larger (worse), and BIC even more so. Both AIC and BIC would lead us to choose model 1, whereas log-likelihood would lead us to choose model 2.

AIC and BIC

Another way of choosing a model to make predictions for new data (also called out-of-sample prediction) is by using *training* and *test* datasets. The idea is that we only use PART of our data to fit the model, then we see how well the model performs in predicting the outcome of interest for the rest of our data. The process is as follows: 

* First, we split our data into two subsets: a training set and a test set. Often, the training set is a larger proportion of the data.
* In other Python libraries, there are built-in functions to split a dataframe, but for the sake of understanding, we'll do it explicitly here by randomly sampling from a list of row indices:

```python
import numpy as np

# Create a list of indices
indices = range(len(rentals))

# Determine the size of the training set (s)
s = int(0.8*len(indices))

# Randomly select 80% of the indices
train_ind = np.random.choice(indices, size = s, replace = False)

# Create a list of the remaining 20% of indices
test_ind = list(set(indices) - set(train_ind))

# Split the data into the training and test sets
rentals_train = rentals.iloc[train_ind]
rentals_test = rentals.iloc[test_ind]
```

* Next, we fit the models we want to compare using the training set data only: 

```python
model1 = sm.OLS.from_formula('rent ~ bedrooms + bathrooms + size_sqft + min_to_subway + floor', data=rentals_train).fit()

model2 = sm.OLS.from_formula('rent ~ bedrooms + bathrooms + size_sqft + min_to_subway + floor + borough', data=rentals_train).fit()
```

* Then, we use those models to predict the rental price for the apartments in the test set:

```python
fitted1 = model1.predict(rentals_test)
fitted2 = model2.predict(rentals_test)
```

* Finally, we can compare the predicted rents to the true rents in the test set and use a metric to determine how well each model performed.
* In this example, we'll use a metric called *predictive root mean squared error (PRMSE)*, which is exactly what the name sounds like: the square root of the mean squared difference between predicted and true values of the outcome variable. A smaller PRMSE means that the model performed better (the predicted values were more similar to the true values):

```python
true = rentals_test.rent
prmse1 = np.mean((true-fitted1)**2)**.5
prmse2 = np.mean((true-fitted2)**2)**.5
print(prmse1) #output: 1326.258
print(prmse2) #output: 1224.269
```

Based on this metric, we would choose the second model over the first one because it has a smaller PRMSE.

Training and Test Sets

Congratulations! In this lesson, you've learned a number of different methods for model comparison:

* For choosing a model that best represents the data we have:
  * R-squared
  * Adjusted R-squared
  * F-test
* For choosing a model for accurate out-of-sample prediction:
  * Log likelihood
  * AIC/BIC
  * Training/test sets

Note that we've covered many different methods for choosing a model and they don't always agree. In order to choose a method, it's important to consider your ultimate goal (analysis vs. prediction) and what you want to prioritize (simplicity and interpretability vs. accuracy)

Review

Choosing a Linear Regression Model

In this module, you'll learn how to choose the best linear regression model for a particular research question. Whether you're interested in prediction or want to understand relationships between multiple variables at once, this module will help you understand common metrics for evaluating and comparing linear regression models.

Learn how to choose the best linear regression model for a particular research question.

Help us understand quiz performance on Codecademy using linear regression.

A dataset has been loaded for you in **script.py** and saved as a dataframe named `codecademy`. We're imagining that this data was collected as part of an experiment to understand factors that contribute to learner performance on a quiz. The data contains three columns:

- `score`: student score on a quiz
- `completed`: the number of other content items on Codecademy that the learner has completed prior to this quiz
- `lesson`: indicates which lesson the learner took directly before the quiz (`'Lesson A'` or `'Lesson B'`)

Take a look at this dataset by printing the first five rows.

Plot a scatter plot of `score` (y-axis) against `completed` (x-axis) to see the relationship between quiz score and number of completed content items. Make sure to show, then clear the plot.

Is there a relationship between these two variables, and does it appear to be linear?

Create and fit a linear regression model that predicts `score` using `completed` as the predictor. Print out the regression coefficients.

Write a one-sentence (each) interpretation of the slope and intercept that you printed out in the previous step. Make sure to comment out the interpretation so your code still runs.

Plot the same scatter plot that you made earlier (with `score` on the y-axis and `completed` on the x-axis), but this time add the regression line on top of the plot. Make sure to show, then clear the plot.

Do you think this line fits the data well?

Use your model to calculate the predicted quiz score for a learner who has previously completed 20 other content items.

Calculate the fitted values for your model and save them as `fitted_values`.

Calculate the residuals for the model and save the result as `residuals`.

Check the normality assumption for linear regression by plotting a histogram of the residuals. Make sure to show and clear the plot.

Do the residuals appear to be approximately normally distributed?

Check the homoscedasticity assumption for linear regression by plotting the residuals (y-axis) against the fitted values (x-axis).

Do you see any patterns or is the homoscedasticity assumption met?

Let’s now turn our attention to the `lessons` column to see if learners who took different lessons scored differently on the quiz.

Use `sns.boxplot` to create a boxplot of `score` (y-variable) for each `lesson` (x-variable) to see the relationship between quiz score and which lesson the learner completed immediately before taking the quiz. Make sure to show, then clear the plot.

Does one lesson appear to do a better job than the other of preparing students for this quiz? If so, which one?


Create and fit a linear regression model that predicts `score` using `lesson` as the predictor. Print out the regression coefficients.

- Calculate and print out the mean quiz scores for learners who took lesson A and lesson B. 
- Calculate and print out the mean difference. 

Can you see how these numbers relate to the intercept and slope that you printed out in the linear regression output?

Congratulations! You've used a simple linear model to understand how quiz scores are related to other learner actions. In this project, we've focused on modeling the relationship between quiz score and one other variable at a time (first we looked at `completed`, then we looked at `lesson` separately). 

The next step in linear regression is to model quiz scores as a function of multiple other variables at once! To get a preview of what this might look like visually, let's try using `seaborn`'s `lmplot()` function to plot a scatter plot of `score` vs. `completed`, colored by `lesson`. For context, the `lm` in `lmplot()` stands for "linear model". This function will automatically plot a linear regression model on top of the scatter plot. The code to implement this looks like:

```python
sns.lmplot(x = 'completed', y = 'score', hue = 'lesson', data = codecademy)
plt.show()
```

Note that when we include a third variable in our plot using the `hue` parameter (which controls the color of each point in the scatter plot), something interesting happens! All of a sudden, we end up with multiple regression lines. To find out how to fit and interpret this model, you'll have to continue learning about linear regression!

Linear Regression at Codecademy

## Introduction 

Linear regression is a machine learning technique that can be used to model the relationship between a quantitative variable and some other variable(s). Those other variables can be either quantitative (e.g., height or salary) or categorical (e.g., job industry or hair color). However, if we want to include categorical predictors in a linear regression model, we need to treat them a little differently than quantitative variables. This article will explore the implementation and interpretation of a single categorical predictor with more than two categories.

## The Data

As an example, we'll use a [dataset from StreetEasy](https://github.com/Codecademy/datasets/tree/master/streeteasy) that contains information about housing rentals in New York City. For now, we'll only focus on two columns of this dataset: 

- `rent`: the rental price of each apartment
- `borough`: the borough that the apartment is located in, with three possible values (`'Manhattan'`, `'Brooklyn'`, and `'Queens'`)

The first five rows of data are printed below:

```python
import pandas as pd
rentals = pd.read_csv('rentals.csv')
print(rentals.head(5))
```

| | rent | borough |
|---|-------|-----------|
| 0 | 5295 | Brooklyn |
| 1 | 4020 | Manhattan |
| 2 | 16000 | Manhattan |
| 3 | 3150 | Queens |
| 4 | 2955 | Queens |

 


## The X Matrix

To understand how we can fit a regression model with a categorical predictor, it's useful to walk through what happens when we use `statsmodels.api.OLS.from_formula()` to create a model. When we pass a formula to this function (like `'weight ~ height'` or `'rent ~ borough'`), it actually creates a new data set, which we don't see. This new data set is often referred to as the *X matrix*, and it is used to fit the model.

When we use a quantitative predictor, the X matrix looks similar to the original data, but with an additional column of `1`s in front (the reasoning behind this column of `1`s is the subject of a future article &mdash; for now, no need to worry about it!). However, when we fit the model with a categorical predictor, something else happens: we end up with additional column(s) of `1`s and `0`s. 

For example, let's say we want to fit a regression predicting `rent` based on `borough`. We can see the X matrix for this model using `patsy.dmatrices()`, which is implemented behind the scenes in `statsmodels`:

```python
import pandas as pd
import patsy

rentals = pd.read_csv('rentals.csv')
y, X = patsy.dmatrices('rent ~ borough', rentals)

# Print out the first 5 rows of X
print(X[0:5])
```

Output: 

```
[[1. 0. 0.]
 [1. 1. 0.]
 [1. 1. 0.]
 [1. 0. 1.]
 [1. 0. 1.]]
```

The first column is all `1`s, just like we would get for a quantitative predictor; but the second two columns were formed based on the `borough` variable. Remember that the first five values of the `borough` column looked like this:

| borough |
|-----------|
| Brooklyn |
| Manhattan |
| Manhattan |
| Queens |
| Queens |


Note that the second column of the X matrix `[0, 1, 1, 0, 0]` is an indicator variable for Manhattan: it is equal to `1` where the value of `borough` is `'Manhattan'` and `0` otherwise. Meanwhile, the third column of the X matrix (`[0, 0, 0, 1, 1]`) is an indicator variable for Queens: it is equal to `1` where the value of `borough` is `'Queens'` and `0` otherwise. 

The X matrix does not contain an indicator variable for Brooklyn. That's because this data set only contains three possible values of `borough`: `'Brooklyn'`, `'Manhattan'`, and `'Queens'`. In order to recreate the `borough` column, we only need two indicator columns &mdash; because any apartment that is not in `'Manhattan'` or `'Queens'` must be `'Brooklyn'`. For example, the first row of the X matrix has `0`s in both indicator columns, indicating that the apartment must be in Brooklyn. Mathematically, we say that a `'Brooklyn'` indicator creates *collinearity* in the X matrix. In regular English: a `'Brooklyn'` indicator does not add any new information.

Because `'Brooklyn'` is missing from the X matrix, it is the *reference category* for this model.


## Implementation and Interpretation

Let's now fit a linear regression model using `statsmodels` and print out the model coefficients:

```python
import statsmodels.api as sm
model = sm.OLS.from_formula('rent ~ borough', rentals).fit()
print(model.params)
```

Output:

```
Intercept 3327.403751
borough[T.Manhattan] 1811.536627
borough[T.Queens] -811.256430
dtype: float64
```

In the output, we see two different slopes: one for `borough[T.Manhattan]` and one for `borough[T.Queens]`, which are the two indicator variables we saw in the X matrix. We can use the intercept and two slopes to construct the following equation to predict rent:

```tex
rent = 3327.4 + 1811.5 * borough[T.Manhattan] - 811.3 * borough[T.Queens]
```

To understand and interpret this equation, we can construct separate equations for each borough:

#### Equation 1: Brooklyn

When an apartment is located in Brooklyn, both `borough[T.Manhattan]` and `borough[T.Queens]` will be equal to zero and the equation becomes:

```tex
\begin{aligned}
rent = 3327.4 + 1811.5 * 0 - 811.3 * 0 \\
rent = 3327.4
\end{aligned}
```

In other words, the intercept is the predicted (average) rental price for an apartment in Brooklyn (the reference category).

#### Equation 2: Manhattan

When an apartment is located in Manhattan, `borough[T.Manhattan] = 1` and `borough[T.Queens] = 0`. The equation becomes:

```tex
\begin{aligned}
rent = 3327.4 + 1811.5 * 1 - 811.3 * 0 \\
rent = 3327.4 + 1811.5 \\
rent = 5138.9
\end{aligned}
```

We see that the predicted (average) rental price for an apartment in Manhattan is *3327.4 + 1811.5*: the intercept (which is the average price in Brooklyn) plus the slope on `borough[T.Manhattan]`. We can therefore interpret the slope on `borough[T.Manhattan]` as the difference in average rental price between apartments in Brooklyn (the reference category) and Manhattan.

#### Equation 3: Queens

When an apartment is located in Queens, `borough[T.Manhattan] = 0` and `borough[T.Queens] = 1`. The equation becomes:

```tex
\begin{aligned}
rent = 3327.4 + 1811.5 * 0 - 811.3 * 1 \\
rent = 3327.4 - 811.3 \\
rent = 2516.1
\end{aligned}
```

We see that the predicted (average) rental price for an apartment in Queens is *3327.4 - 811.3*: the intercept (which is the average price in Brooklyn) plus the slope on `borough[T.Queens]` (which happens to be negative because Queens apartments are less expensive than Brooklyn apartments). We can therefore interpret the slope on `borough[T.Queens]` as the difference in average rental price between apartments in Brooklyn (the reference category) and Queens.

We can verify our understanding of all these coefficients by printing out the average rental prices by borough:

```python
print(rentals.groupby('borough').mean())
```

Output:

```
 rent
borough 
Brooklyn 3327.403751
Manhattan 5138.940379
Queens 2516.147321
```

The average prices in each borough come out to the exact same values that we predicted based on the linear regression model! For now, this may seem like an overly complicated way to recover mean rental prices by borough, but it is important to understand how this works in order to build up more complex linear regression models in the future. 

## Changing the Reference Category

In the example above, we saw that `'Brooklyn'` was the default reference category (because it comes first alphabetically), but we can easily change the reference category in the model as follows:

```python
model = sm.OLS.from_formula('rent ~ C(borough, Treatment("Manhattan"))', rentals).fit()
print(model.params)
```

Output:

```
Intercept 5138.940379
C(borough, Treatment("Manhattan"))[T.Brooklyn] -1811.536627
C(borough, Treatment("Manhattan"))[T.Queens] -2622.793057
dtype: float64
```

In this example, the reference category is `'Manhattan'`. Therefore, the intercept is the mean rental price in Manhattan, and the other slopes are the mean differences for Brooklyn and Queens in comparison to Manhattan.

## Other Python Libraries for fitting Linear Models

There are a few different Python libraries that can be used to fit linear regression models. It is therefore important to understand how this implementation differs for each library. In `statsmodels`, the creation of the X matrix happens completely "behind the scenes" once we pass in a model formula.

In `scikit-learn` (another popular library for linear regression), we actually need to construct the indicator variables ourselves. Note that we do not have to construct the extra column of `1`s that we saw in the X matrix &mdash; this also happens behind the scenes in `scikit-learn`. In order to construct those indicator variables, the `pandas` `get_dummies()` function is extremely useful:

```python
import pandas as pd
rentals = pd.get_dummies(rentals, columns = ['borough'], drop_first = True)
print(rentals.head())
```

Output: 

```
 rent borough_Manhattan borough_Queens
0 5295 0 0
1 4020 1 0
2 16000 1 0
3 3150 0 1
4 2955 0 1
```

Setting `drop_first = True` tells Python to drop the first indicator variable (for `'Brooklyn'` in thie case), which is what we need for linear regression. We can then fit the exact same model using `scikit-learn` as follows:

```python
from sklearn.linear_model import LinearRegression

X = rentals[['borough_Manhattan', 'borough_Queens']]
y = rentals[['rent']]

# Fit model
regr = LinearRegression()
regr.fit(X, y)
print(regr.intercept_)
print(regr.coef_)
```

```
LinearRegression()
[3327.40375123]
[[1811.5366274 -811.25642981]]
```

## Conclusion

In this article, we've walked through an example of how to implement and interpret categorical predictors in a linear regression model. In the process, we've learned a little bit about what happens behind the scenes when we fit a linear model using `statsmodels` or `scikit-learn`. This knowledge will help prepare us to fit and interpret more complex models that build upon these foundations.

<Assessment id="604f7040d8c8b100104d83c4" />

Learn how to fit and interpret a linear model with a categorical predictor that has more than two categories.

Linear Regression with a Categorical Predictor

Learn how to fit a simple linear regression model.

Linear regression is a powerful modeling technique that can be used to understand the relationship between a quantitative variable and one or more other variables, sometimes with the goal of making predictions. For example, linear regression can help us answer questions like:

- What is the relationship between apartment size and rental price for NYC apartments?
- Is a mother's height a good predictor of their child's adult height?

The first step before fitting a linear regression model is exploratory data analysis and data visualization: is there a relationship that we can model? For example, suppose we collect heights (in cm) and weights (in kg) for 9 adults and inspect a plot of height vs. weight:

```python
plt.scatter(data.height, data.weight)
plt.xlabel('height (cm)')
plt.ylabel('weight (kg)')
plt.show()
```

![scatter plot showing a positive linear relationship between height and weight (people who are taller tend to weigh more)](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-linear-regression/height_weight_scatter.svg)

When we look at this plot, we see that there is some evidence of a relationship between height and weight: people who are taller tend to weigh more. In the following exercises, we'll learn how to model this relationship with a line. If you were to draw a line through these points to describe the relationship between height and weight, what line would you draw?

Introduction to Linear Regression

Like the name implies, LINEar regression involves fitting a line to a set of data points. In order to fit a line, it's helpful to understand the equation for a line, which is often written as *y=mx+b*. In this equation:

- *x* and *y* represent variables, such as height and weight or hours of studying and quiz scores.
- *b* represents the *y-intercept* of the line. This is where the line intersects with the y-axis (a vertical line located at *x = 0*). 
- *m* represents the slope. This controls how steep the line is. If we choose any two points on a line, the slope is the ratio between the vertical and horizontal distance between those points; this is often written as rise/run.

The following plot shows a line with the equation *y = 2x + 12*:

![image showing a line with a point at the y-axis (a vertical line where the x-variable is equal to zero) labeled "y-intercept." The line also has two other points, which are connected by a horizontal and vertical dashed line, labeled "run" and "rise," respectively. The slope is calculated as rise/run which is equal to 2 in this example.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-linear-regression/equation_of_line.svg)

Note that we can also have a line with a negative slope. For example, the following plot shows the line with the equation *y = -2x + 8*:

![image showing a line with a point at the y-axis (a vertical line where the x-variable is equal to zero) labeled "y-intercept." The line also has two other points, which are connected by a horizontal and vertical dashed line, labeled "run" and "rise," respectively. The slope is calculated as rise/run which is equal to -2 in this example.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-linear-regression/equation_of_line_negslope.svg)

Equation of a Line

In the last exercise, we tried to eye-ball what the best-fit line might look like. In order to actually choose a line, we need to come up with some criteria for what "best" actually means.

Depending on our ultimate goals and data, we might choose different criteria; however, a common choice for linear regression is *ordinary least squares (OLS)*. In simple OLS regression, we assume that the relationship between two variables *x* and *y* can be modeled as: 

```tex
y = mx + b + error
```

We define "best" as the line that minimizes the *total squared error* for all data points. This total squared error is called the *loss function* in machine learning. For example, consider the following plot:

![plot showing two points on either side of a line. One point is one unit below the line and has a label of -1; the other is 3 units above the line and has a label of 3](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-linear-regression/loss.svg)

In this plot, we see two points on either side of a line. One of the points is one unit below the line (labeled -1). The other point is three units above the line (labeled 3). The total squared error (loss) is: 

```tex
loss = (-1)^2 + (3)^2 = 1 + 9 = 10
```

Notice that we square each individual distance so that points below and above the line contribute equally to loss (when we square a negative number, the result is positive). To find the best-fit line, we need to find the slope and intercept of the line that minimizes loss.

Finding the "Best" Line

There are a number of Python libraries that can be used to fit a linear regression, but in this course, we will use the `OLS.from_formula()` function from `statsmodels.api` because it uses simple syntax and provides comprehensive model summaries. 

Suppose we have a dataset named `body_measurements` with columns `height` and `weight`. If we want to fit a model that can predict weight based on height, we can create the model as follows:

```python
model = sm.OLS.from_formula('weight ~ height', data = body_measurements)
```

We used the formula `'weight ~ height'` because we want to predict `weight` (it is the outcome variable) using `height` as a predictor. Then, we can fit the model using `.fit()`:

```python
results = model.fit()
```

Finally, we can inspect a summary of the results using `print(results.summary())`. For now, we'll only look at the coefficients using `results.params`, but the full summary table is useful because it contains other important diagnostic information.

```python
print(results.params)
```

Output:

```
Intercept   -21.67
height        0.50
dtype: float64
```

This tells us that the best-fit intercept is `-21.67`, and the best-fit slope is `0.50`.

Fitting a Linear Regression Model in Python

Suppose that we have a dataset of heights and weights for 100 adults. We fit a linear regression and print the coefficients:

```python
model = sm.OLS.from_formula('weight ~ height', data = body_measurements)
results = model.fit()
print(results.params)
```

Output: 

```
Intercept   -21.67
height        0.50
dtype: float64
```

This regression allows us to predict the weight of an adult if we know their height. To make a prediction, we need to plug in the intercept and slope to our equation for a line. The equation is:

```tex
weight = 0.50*height - 21.67
```

To make a prediction, we can plug in any height. For example, we can calculate that the expected weight for a 160cm tall person is 58.33kg:

```tex
weight = 0.50*160-21.67 = 58.33
```
  


In python, we can calculate this by plugging in values or by accessing the intercept and slope from `results.params` using their indices (`0` and `1`, respectively):

```python
print(0.50 * 160 - 21.67) 
# Output: 58.33

# OR:

print(results.params[1]*160 + results.params[0])
# Output: 58.33
```

We can also do this calculation using the `.predict()` method on the fitted model. To predict the weight of a 160 cm tall person, we need to first create a new dataset with `height` equal to `160` as shown below:

```python
newdata = {"height":[160]}
print(results.predict(newdata))
```

Output:

```
0      58.33
dtype: float64
```

Note that we get the same result (`58.33`) as with the other methods; however, it is returned as a data frame.

Using a Regression Model for Prediction

Let's again inspect the output for a regression that predicts weight based on height. The regression line looks something like this: 

![plot of height vs. weight with a regression line drawn through the points](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-linear-regression/height_weight_scatter_line.svg)

Note that the units of the intercept and slope of a regression line match the units of the original variables; the intercept of this line is measured in kg, and the slope is measured in kg/cm. To make sense of the intercept (which we calculated previously as `-21.67` kg), let's zoom out on this plot:

![plot of height vs. weight with a regression line drawn through the points, zoomed out so that we can see where the line crosses the y-axis (a vertical line at x=0)](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-linear-regression/height_weight_intercept.svg)

We see that the intercept is the predicted value of the outcome variable (weight) when the predictor variable (height) is equal to zero. In this case, the interpretation of the intercept is that a person who is 0 cm tall is expected to weigh -21 kg. This is pretty non-sensical because it's impossible for someone to be 0 cm tall! 

However, in other cases, this value does make sense and is useful to interpret. For example, if we were predicting ice cream sales based on temperature, the intercept would be the expected sales when the temperature is 0 degrees.

To visualize the slope, let's zoom in on our plot:

![plot of height vs. weight with a regression line drawn through the points, zoomed in so that we can see that each horizontal increase of height by 1cm is associated with a 0.5 kg increase in height.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-linear-regression/height_weight_slope2.svg)

Remember that slope can be thought of as rise/run &mdash; the ratio between the vertical and horizontal distances between any two points on the line. Therefore, the slope (which we previously calculated to be `0.50` kg/cm) is the expected difference in the outcome variable (weight) for a one unit difference in the predictor variable (height). In other words, we expect that a one centimeter difference in height is associated with .5 additional kilograms of weight.

Note that the slope gives us two pieces of information: the magnitude AND the direction of the relationship between the `x` and `y` variables. For example, suppose we had instead fit a regression of weight with minutes of exercise per day as a predictor &mdash; and calculated a slope of `-.1`. We would interpret this to mean that people who exercise for one additional minute per day are expected to weigh 0.1 kg LESS.

Interpreting a Regression Model

There are a number of assumptions of simple linear regression, which are important to check if you are fitting a linear model. The first assumption is that the relationship between the outcome variable and predictor is linear (can be described by a line). We can check this before fitting the regression by simply looking at a plot of the two variables.

The next two assumptions (normality and homoscedasticity) are easier to check after fitting the regression. We will learn more about these assumptions in the following exercises, but first, we need to calculate two things: *fitted values* and *residuals*. 

Again consider our regression model to predict weight based on height (model formula `'weight ~ height'`). The fitted values are the predicted weights for each person in the dataset that was used to fit the model, while the residuals are the differences between the predicted weight and the true weight for each person. Visually:

![plot showing a line with points on either side. Dotted lines connect each point to the closest vertical location on the line, which is labeled as the fitted value for that point.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-linear-regression/resids_fvs.svg)

We can calculate the fitted values using `.predict()` by passing in the original data. The result is a pandas series containing predicted values for each person in the original dataset:

```python
fitted_values = results.predict(body_measurements)
print(fitted_values.head())
```

Output:

```
0    66.673077
1    59.100962
2    71.721154
3    70.711538
4    65.158654
dtype: float64
```

The residuals are the differences between each of these fitted values and the true values of the outcome variable. They can be calculated by subtracting the fitted values from the actual values. We can perform this element-wise subtraction in Python by simply subtracting one python series from the other, as shown below:

```python
residuals = body_measurements.weight - fitted_values
print(residuals.head())
```

Output:

```
0   -2.673077
1   -1.100962
2    3.278846
3   -3.711538
4    2.841346
dtype: float64
```

Assumptions of Linear Regression Part 1

Once we've calculated the fitted values and residuals for a model, we can check the normality and homoscedasticity assumptions of linear regression.

##### Normality Assumption  

The normality assumption states that the residuals should be normally distributed. This assumption is made because, statistically, the residuals of any independent dataset will approach a normal distribution when the dataset is large enough. To learn more about this click [here](https://en.wikipedia.org/wiki/Central_limit_theorem).

To check the normality assumption, we can inspect a histogram of the residuals and make sure that the distribution looks approximately normal (no skew or multiple "humps"):
  
```python
plt.hist(residuals)
plt.show()
```

![symmetric histogram with a single hump](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-linear-regression/resids_normal.svg)

These residuals appear normally distributed, leading us to conclude that the normality assumption is satisfied. 

If the plot instead looked something like the distribution below (which is skewed right), we would be concerned that the normality assumption is not met:

![histogram which is right skewed: there is a longer tail on the right side than the left](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-linear-regression/resids_not_normal.svg)

##### Homoscedasticity Assumption

Homoscedasticity is a fancy way of saying that the residuals have equal variation across all values of the predictor (independent) variable. When homoscedasticity is not met, this is called heteroscedasticity, meaning that the variation in the size of the error term differs across the independent variable.

Since a linear regression seeks to minimize residuals and gives equal weight to all observations, heteroscedasticity can lead to bias in the results.

A common way to check this is by plotting the residuals against the fitted values.
  
```python
plt.scatter(fitted_values, residuals)
plt.show()
```

![scatter plot with a splatter of randomly distributed points](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-linear-regression/fittedvals_resids_nopattern.svg)

If the homoscedasticity assumption is met, then this plot will look like a random splatter of points, centered around y=0 (as in the example above). 

If there are any patterns or asymmetry, that would indicate the assumption is NOT met and linear regression may not be appropriate. For example:

![scatter plot with a funnel-shaped pattern of points](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-linear-regression/fittedvals_resids_pattern.svg)



Assumptions of Linear Regression Part 2

In the previous exercises, we used a quantitative predictor in our linear regression, but it's important to note that we can also use categorical predictors. The simplest case of a categorical predictor is a binary variable (only two categories).

For example, suppose we surveyed 100 adults and asked them to report their height in cm and whether or not they play basketball. We've coded the variable `bball_player` so that it is equal to `1` if the person plays basketball and `0` if they do not. A plot of `height` vs. `bball_player` is below:

![Scatter plot of height vs. whether or not someone plays basketball (0 means they don't, and 1 means they do); non-basketball players appear shorter on average than basketball players.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-linear-regression/height_bball.svg)

We see that people who play basketball tend to be taller than people who do not. Just like before, we can draw a line to fit these points. Take a moment to think about what that line might look like!

You might have guessed (correctly!) that the best fit line for this plot is the one that goes through the mean height for each group. To re-create the scatter plot with the best fit line, we could use the following code:

```python
# Calculate group means
print(data.groupby('play_bball').mean().height)
```
Output:

|   | **play_bball** |
|---|----------------|
| 0 | 169.016        |
| 1 | 183.644        |

```python
# Create scatter plot
plt.scatter(data.play_bball, data.height)

# Add the line using calculated group means
plt.plot([0,1], [169.016, 183.644])

# Show the plot
plt.show()
```

This will output the following plot (without the additional labels or colors):

![Same scatterplot as above, but with a line connecting the middle of the non-bball player heights to the middle of the bball player heights.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-linear-regression/height_bball_line.svg)

Categorical Predictors

Now that we've seen what a regression model with a binary predictor looks like visually, we can actually fit the model using `statsmodels.api.OLS.from_formula()`, the same way we did for a quantitative predictor:

```python
model = sm.OLS.from_formula('height ~ play_bball', data)
results = model.fit()
print(results.params)
```

Output:

```
Intercept     169.016
play_bball     14.628
dtype: float64
```

Note that this will work if the `play_bball` variable is coded with `0`s and `1`s, but it will also work if it is coded with `True`s and `False`s, or even if it is coded with strings like `'yes'` and `'no'` (in this case, the coefficient label will look something like `play_bball[T.yes]` in the `params` output, indicating that `'yes'` corresponds to a `1`).

To interpret this output, we first need to remember that the intercept is the expected value of the outcome variable when the predictor is equal to zero. In this case, the intercept is therefore the mean height of non-basketball players.

The slope is the expected difference in the outcome variable for a one unit difference in the predictor variable. In this case, a one unit difference in `play_bball` is the difference between not being a basketball player and being a basketball player. Therefore, the slope is the difference in mean heights for basketball players and non-basketball players.

Categorical Predictors: Fit and Interpretation

Congratulations! As a recap, you've learned to:

* Fit a simple OLS linear regression model
* Use both quantitative and binary categorical predictors
* Interpret the coefficients of a regression model
* Check the assumptions of a regression model


Is a person's age a good predictor of whether or not they smoke?

What is the relationship between a person's salary and years of working experience?

Is whether or not a person smokes a good predictor of their resting systolic blood pressure?

Is there a relationship between a person's resting heart rate and their 1km race time?

*m* is the slope and *b* is the y-intercept

*b* is the slope and *m* is the y-intercept

*x* is the slope and *y* is the y-intercept

*mx* is the slope and *y-b* is the y-intercept

The squared vertical distance between each point and the line

The squared horizontal distance between each point and the line

The cubed vertical distance between each point and the line

The size of a square that can contain all of the points

```tex
-1.5 + 8.0*2 = -1.5 + 16 = 14.5
```

Drinking one additional cup of coffee is associated with 1.5 fewer hours of sleep.

Drinking 1.5 fewer cups of coffee is associated with 1 fewer hours of sleep.

Drinking 1.5 fewer cups of coffee is associated with 1.5 fewer hours of sleep.

A person who trains for 0 weeks is expected to have a race time of 20 minutes.

On average, people in this dataset trained for 20 weeks.

People who train for 20 weeks are expected to finish the race in 0 minutes.

The predictor variable must be quantitative

There is a linear relationship between the predictor and outcome variables.

The variance of the residuals is the same for all values of the predictor (homoscedasticity)

Test your knowledge of simple linear regression.

Simple Linear Regression

```python
'time ~ sugar + np.power(sugar,2)'
```

We use polynomial terms in multiple regression models when the relationship between a predictor and response variable is a straight line.

We use interaction terms in multiple regression models when the relationship between a predictor and a response variable is modified by the value of another predictor.

Multiple linear regression allows us to control for confounding variables, which may distort the perceived relationship between two variables.

Adding predictors to a multiple linear regression model may change the coefficient of a predictor variable.

*time = 3.1 + 3.4\*sugar - 0.5\*sugar2*

We can't interpret the coefficient on sugar directly.

*time = 3.1 + 3.4\*sugar - 0.5\*sugar\*2*

We can't interpret the coefficient on sugar directly.

*time = 3.1 + 3.4\*sugar - 0.5\*sugar2*

For every additional gram of sugar, time to complete the task increases by 3.4 minutes.

model = sm.OLS.from_formula('time ~ sugar + np.power(sugar,2)').fit()

# Output:
# Intercept             3.058995
# sugar                 3.429378
# np.power(sugar, 2)   -0.534879

The coefficient on `humid` for the `moon = 1` group is 18.1 units LESS THAN the coefficient on `humid` for the `moon = 0` group.

The coefficient on `humid` for the `moon = 1` group is 14.7 units LESS THAN the coefficient on `humid` for the `moon = 0` group.

The coefficient on `humid` for the `moon = 1` group is 18.1 units GREATER THAN the coefficient on `humid` for the `moon = 0` group.

The coefficient on `humid` for the `moon = 1` group is 14.7 units GREATER THAN the coefficient on `humid` for the `moon = 0` group.

Test your knowledge of advanced linear regression topics.

Multiple Linear Regression

## Introduction

In this article we will walk through the matrix representation of the regression problem. While understanding the underlying math is not necessary in order to fit a regression model, it is useful in diagnosing problems, improving models, and working with new libraries and technologies.


## Data to Matrix transformation: Simple Linear Regression

A numerical matrix is simply a rectangular array of numbers. It is not difficult to see how a DataFrame already looks a lot like that. As an example, let's look at a dataset from the [StreetEasy dataset](https://github.com/Codecademy/datasets/tree/master/streeteasy), which contains data about housing rentals in Brooklyn.

```python
import pandas as pd
df = pd.read_csv('brooklyn.csv')
#columns we're interested in
bk = df[[‘rent’, ‘bedrooms’, ‘bathrooms’, ‘size_sqft’, ‘min_to_subway’, ’building_age_yrs’, ‘has_washer_dryer’]]
print(bk.head(5))
```

| |rent |bedrooms|bathrooms|size_sqft|min_to_subway|building_age_yrs|has_washer_dryer|
|---|-----|-----|-----|-----|-----|-----|-----|
| 0 | 3600 | 3.0 | 2 | 900 | 4 | 15 | 0 |
|1|3900|3.0|2|1000|4|8|0|
|2|2700|2.0|1|900|4|96|0|
|3|4900|1.0|1|1216|6|88|0|
|4|3900|0.0|1|1100|3|85|0|

 

A quick glance at the dataset tells us that there is more than one variable that might be predictive of rental price. For starters, let's focus on the apartment size. Suppose we fit the date with the following simple linear regression model with slope *m* and intercept *b*:
```tex
\text{rent} = m*\text{size\_sqft} + b + \text{error}
```
This equation is actually short-hand for a large number of equations &mdash; one for each apartment in our dataset. The first five equations (corresponding to the first five rows of the dataset) are:
```tex
\begin{aligned}
3600 = m*900 + b + \text{error}_1 \\
3900 = m*1000 + b + \text{error}_2 \\
2700 = m*900 + b + \text{error}_3 \\
4900 = m*1216 + b + \text{error}_4 \\
3900 = m*1100 + b + \text{error}_5
\end{aligned}
```

When we fit this linear regression model, we are trying to find the values of *m* and *b* such that the sum of the squared error terms above (eg., *error_1^2 + error_2^2 + error_3^2 + error_4^2 + error_5^2 + ....*) is minimized.

We create a column matrix of rents (the outcome variable), a column matrix of apartment sizes (the predictor variable) and a column matrix of the errors and rewrite the five equations above as one matrix equation:

```tex
\begin{pmatrix} 3600 \\ 3900 \\2700 \\4900 \\3900 \end{pmatrix} = m* \begin{pmatrix} 900 \\1000 \\900 \\1216 \\1100 \end{pmatrix} + b*\begin{pmatrix} 1\\ 1\\ 1\\ 1\\ 1\\ \end{pmatrix} + \begin{pmatrix} \text{error}_1\\ \text{error}_2\\ \text{error}_3\\ \text{error}_4\\ \text{error}_5\\ \end{pmatrix}
```
We can do so because when we add two matrices of the same size, an element in one matrix gets added to the corresponding element in the other matrix that occupies the same position (row, column). Also, when we multiply a matrix by a constant, each element gets multiplied by it. So:
```tex
m* \begin{pmatrix} 900 \\1000 \\900 \\1216 \\1100 \end{pmatrix} = \begin{pmatrix} 900m \\ 1000m \\ 900m \\ 1216m \\ 1100m \end{pmatrix} \text{ and similarly, } b*\begin{pmatrix} 1\\ 1\\ 1\\ 1\\ 1\\ \end{pmatrix} = \begin{pmatrix} b\\ b\\ b\\ b\\ b\\ \end{pmatrix}
```

We can simplify this even further by combining the column of 1’s with the column of the apartment sizes (the predictor variable) into a two-column matrix. This works because of the following matrix algebra (this is how matrix multiplication works!): 
```tex
\begin{pmatrix} 1 & 900 \\ 1 & 1000 \\ 1 & 900 \\ 1 & 1216\\ 1 & 1100 \end{pmatrix} * \begin{pmatrix} b \\ m \end{pmatrix} = \begin{pmatrix} b + 900m \\ b+ 1000m\\ b + 900m \\ b + 1216m \\ b + 1100m \\ \end{pmatrix} 
```


Therefore, the most simple version of our matrix equation look like this:
```tex
\begin{pmatrix} 3600 \\ 3900 \\2700 \\4900 \\3900 \end{pmatrix} =
\begin{pmatrix} 1 & 900 \\ 1 & 1000 \\ 1 & 900 \\ 1 & 1216\\ 1 & 1100 \end{pmatrix} \begin{pmatrix} b \\ m \end{pmatrix} +\begin{pmatrix} \text{error}_1\\ \text{error}_2\\ \text{error}_3\\ \text{error}_4\\ \text{error}_5\\ \end{pmatrix}
```

In total we have 4 matrices in this equation:
* A one-column matrix on the left hand side of the equation containing the outcome variable values (*rent* here) that we will call *y*
* A two-column matrix on the right hand side that contains a column of 1’s and a column of the predictor variable values (*size_sqft* here) that we will call *X*. This is also known as the *design matrix* or *X matrix*.
* A one-column matrix containing the intercept *b* and the slope *m*, i.e, the solution matrix that we will denote by the Greek letter *beta*. The goal of the regression problem is to evaluate this matrix.
* A one-column matrix of the residuals or errors, the error matrix. The regression problem can be solved by minimizing the sum of the squares of the elements of this matrix. The error matrix will be denoted by the Greek letter *epsilon*. 

Using these shorthands, the matrix representation of the regression equation is thus:
```tex
y = X*\beta + \epsilon
```

## Multiple Linear Regression

Now there are more factors than the size of an apartment that likely influence its rental price. If we want to regress on more than one of these variables, our regression equation will look as follows:

```tex
rent = b_0 + b_1*\text{size\_sqft} + b_2*\text{min\_to\_subway} + b_3*\text{has\_washer\_dryer} + .. + \text{error}
```
We are now in the territory of *Multiple Linear Regression* models. If we denote the different predictor variables by *{x1, x2, x3, …, xm}*, their corresponding slopes by *{b1,b2,b3,.., bm}* and the intercept by *b0*, we can write a more general form of the above equation:
```tex
y = b_0 + b_1x_1 + b_2x_2 + ... + b_mx_m + \text{error}
```
If our dataset has *n* data points, we will have *n* such equations:
```tex
\begin{aligned}
y_1 = b_0 + b_1x_{11} + b_2x_{21} + ... + b_mx_{m1} + \text{error}_1\\
y_2 = b_0 + b_1x_{12} + b_2x_{22} + ... + b_mx_{m2} + \text{error}_2\\
y_3 = b_0 + b_1x_{13} + b_2x_{23} + ... + b_mx_{m3} + \text{error}_3\\
. . . . . . . . . . . . . .\\
y_n = b_0 + b_1x_{1n} + b_2x_{2n} + ... + b_mx_{mn} + \text{error}_n\\
\end{aligned}
```

The matrix formulation for multiple linear regression thus looks as follows:
```tex
\begin{pmatrix} y_1 \\ y_2 \\ y_3 \\ . \\ . \\ . \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_{11} & x_{21} & . & . & . & x_{m1} \\1 & x_{12} & x_{22} & . & . & . & x_{m2} \\
1 & x_{13} & x_{23} & . & . & . & x_{m3} \\
 . & . & . & . & . & . & \\
 . & . & . & . & . & . & \\
 . & . & . & . & . & . & \\
 1 & x_{1n} & x_{2n} & . & . & . & x_{mn} \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \\ b_2 \\ . \\ . \\ . \\ b_m \end{pmatrix} + \begin{pmatrix} \text{error}_0 \\ \text{error}_1 \\ \text{error}_2 \\ . \\ . \\ . \\ \text{error}_n \end{pmatrix}
```

We still have the same 4 matrices that we had in the case of simple linear regression: 
* a column matrix *y* of the outcome variable values
* the design matrix *X* which has a column of 1’s and a column for each predictor variable
* a column matrix of the intercept and slopes, the solution matrix
* the error matrix

In its condensed form the equation is exactly the same as the one we saw in the case of simple linear regression:
```tex
y = X\beta + \epsilon
```
To solve the regression problem, we need to calculate the *beta* matrix such that the sum of the squared errors is minimized. This is the Ordinary Least Squares (OLS) method. We can solve this problem using a combination of calculus and matrix algebra. Without going too deep into the math, the solution to this problem looks like this (we'll explain some of the notation later in this article!):

```tex
\beta = (X^T X)^{-1}X^Ty
```

## Regression Solution using the Matrix Method

The y and X matrices discussed so far are exactly the same as the inputs to a regression model run using `statsmodels`'s `sm.OLS` function. Let’s build a multiple regression model where we regress the rental price on the age of the building, time taken to get to the nearest subway and whether the building has a washer-dryer unit or not. We can fit this model in two steps using `patsy`, which allows us to examine the design matrix:

```python
from patsy import dmatrices
y, X = dmatrices('rent ~ building_age_yrs+ min_to_subway + has_washer_dryer', bk, return_type='dataframe')
print(X)
```
Output
```
 Intercept building_age_yrs min_to_subway has_washer_dryer
0 1.0 15.0 4.0 0.0
1 1.0 8.0 4.0 0.0
2 1.0 96.0 4.0 0.0
... ... ... ... ...

1011 1.0 5.0 1.0 0.0
1012 1.0 116.0 3.0 1.0

```
Note how `patsy` has already done the work of adding a column of 1’s for us! We’re now ready to fit this data using the Ordinary Least Squares (OLS) Method:

```python
import statsmodels.api as sm
model = sm.OLS(y,X)
results = model.fit()
print(results.params, '\n', results.summary().extra_txt)
```
Output:
```
Intercept 3696.170035
building_age_yrs -5.465115
min_to_subway -25.179742
has_washer_dryer 719.578411
dtype: float64
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
 ```
A quick examination of the results tell us: 
* There is a negative slope for building age and minutes to the nearest subway. 
* There is a high positive slope associated with having a washer-dryer and the intercept is comparable to the average rent of a Brooklyn apartment unit. 

(The `extra_txt` attribute of the summary object usually appears at the bottom of the summary - there is nothing much to note here at the moment.)

Next, let's use the matrix solution we saw earlier to calculate the coefficients. We'll use `numpy`'s matrix algebra methods to do all the math. Then, we can compare the slopes and intercept to what we calculated using `statsmodels`. First, let’s convert our X to a `numpy` matrix object:

```python
import numpy as np
np.set_printoptions(suppress=True)
X_matrix = np.matrix(X.to_numpy())
print(X_matrix)
```
Output:
```
[[ 1. 15. 4. 0.]
 [ 1. 8. 4. 0.]
 [ 1. 96. 4. 0.]
 ...
 [ 1. 117. 5. 0.]
 [ 1. 5. 1. 0.]
 [ 1. 116. 3. 1.]]

```

Let’s take a closer look at our solution equation:
```tex
\beta = (X^T X)^{-1}X^Ty
```
The `T` in this expression stands for the transpose of a matrix which is the matrix obtained by interchanging the rows and columns of a matrix. We first write down the transpose using `numpy`'s `.transpose()` method and obtain the solution as follows:
```python
X_t = X_matrix.transpose()
beta = np.linalg.inv(X_t*X_matrix)*X_t*y.to_numpy()
print(beta)
```
Output:
```
[[3696.17003521]
 [ -5.46511521]
 [ -25.17974166]
 [ 719.5784114 ]]
```
We get the same values as we did using `statsmodels`!

## The Matrix Representation and Multicollinearity

It is essential that our inputs to the regression not be collinear, i.e, we need to make sure we don't have *multicollinearity*. The reason behind this is related to the matrix solution of the regression problem. Multicollinearity happens when two or more input variables are linearly related. For instance, suppose we have another variable that indicates the seconds it takes to get to the nearest subway. It differs from the `min_to_subway` predictor variable exactly by a factor of 60 (1 min = 60 secs).

```python
bk['secs_to_subway'] = bk['min_to_subway']*60.0
y, X = patsy.dmatrices('rent ~ building_age_yrs + min_to_subway + secs_to_subway +has_washer_dryer', bk, return_type='dataframe')
```

If we run a multiple regression model on both of these variables using the matrix solution method described above, `numpy` throws up the following error:

```python
---------------------------------------------------------------------------
LinAlgError Traceback (most recent call last)
<ipython-input-13-732894adb9ea> in <module>
----> 1 beta = np.linalg.inv(X_matrix.transpose()*X_matrix)*X_matrix.transpose()*y.to_numpy()
 2 print(beta)
...
_raise_linalgerror_singular(err, flag)
....
LinAlgError: Singular matrix
```

In this scenario, we might already have felt skeptical about running a regression on these two columns together as they contain pretty much the same information - the time taken to the subway. Mathematically, when there are two (or more) linearly related columns in the design matrix, it is called a *singular matrix*. A singular matrix has a property called an *eigenvalue* that is very close to zero and this makes the matrix solution for the regression problem unsolvable. 

What happens when we try to fit this model using `statsmodels`?
```python
model = sm.OLS(y,X)
results = model.fit()
print(results.params ,'\n', results.summary().extra_txt)
```
Output:
```
Intercept 3696.170035
building_age_yrs -5.465115
min_to_subway -0.006992
secs_to_subway -0.419546
has_washer_dryer 719.578411
dtype: float64
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 6.01e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
```
We see that `statsmodels` does gives us a solution but the results tell us that something is wrong here: one of the slopes is very close to zero and the *Notes* indicate that one of the eigenvalues is extremely close to zero. This is why it is a great idea to first look at the correlation matrix between all the predictor variables in a dataset to determine which variables might be collinear before proceeding to fit a regression model.

Let us consider what this would look like in the case of categorical variables. Suppose we made a dummy variable, `no_washer_dryer` that is basically the opposite of `has_washer_dryer`. We're going to try to fit a model that incorporates this obviously collinear variable as well:
```python
bk['no_washer_dryer'] = np.logical_not(bk['has_washer_dryer'])
y, X = dmatrices('rent ~ building_age_yrs + has_washer_dryer + no_washer_dryer', bk, return_type='dataframe')
model = sm.OLS(y,X)
results = model.fit()
print(results.params ,'\n', results.summary().extra_txt)
```
Output
```
Intercept 2637.744542
no_washer_dryer[T.True] 945.446042
building_age_yrs -5.608355
has_washer_dryer 1692.298500
dtype: float64 
 Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.59e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
```

Once again we get the warning about an eigenvalue that's very close to zero and the possibility that the design matrix is singular, suggesting that we need to reconsider our inputs to the regression model. For categorical variables, this means we need to be careful not to include the reference group in our model.

To recap:
* For quantitative variables, check the correlation matrix to catch highly correlated variables.
* For categorical variables, check the that at least one category (the reference group) is omitted from the X matrix.
* It is important to check if there are warnings contained within the summary object outputted by `statsmodels`.


### Practice

<Assessment id="607de0ad84fe810011b6f8da" />

Learn about the matrix representation of the regression problem.

Matrix Representation of Linear Regression

[Kaggle]: https://www.kaggle.com/fernandol/countries-of-the-world
[Python]: https://www.codecademy.com/resources/docs/python
[NumPy]: https://www.codecademy.com/resources/docs/numpy
[`.log()`]: https://www.codecademy.com/resources/docs/numpy/built-in-functions/log
[read in the CSV dataset]: https://www.codecademy.com/resources/docs/pandas/built-in-functions/read-csv
[Penn State Statistics Department's website]: https://online.stat.psu.edu/stat501/lesson/9

[Scatter plot of phones on the y-axis plotted against birth rate on the x-axis. The pattern appears negative, but the points are widely dispersed on the left and narrow to almost a flat line of points as we move right. The automatically plotted regression line shows a negative relationship, but cuts through curves in the pattern of points. The points of nine countries are highlighted and their residuals are drawn in. From left to right across the plot the countries are: Bermuda, Cuba, Australia, Algeria, Israel, Fiji, Nepal, Mayotte, and Angola.]: https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/annotated_pVb1.svg
[Histogram of residuals that is right skewed.]: https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/hist1.svg
[Scatter plot of residuals versus fitted values that shows points clustered close together on the left, widening dramatically as we move from left to right.]: https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/annotated_variance1.svg
[Scatter plot of log phones versus birth rate. The pattern is still negative, but the points now fit closer to the regression line and are more evenly spaced around it.]: https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/annotated_pVb2.svg
[Histogram of residuals that is more bell-shaped than the first skewed histogram.]: https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/hist2.svg
[Scatter plot of residuals against fitted values that shows a more even spread of points about the line y = 0 compared to the first model's scatter plot of residuals against fitted values.]: https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/annotated_variance2.svg

## Introduction

When fitting a linear regression model, we use interaction and polynomial terms to capture complex relationships and improve predictive accuracy. We create these new terms by multiplying predictors together or by raising them to higher exponential powers and then add our new predictors to our model. These are examples of transformations of predictor variables, but sometimes we may want to transform the response (dependent) variable instead. This article will specifically explore when it might make sense to perform a log transformation of the response variable to improve a multiple linear regression model and how to interpret the resulting regression equation.

## When to use a log transform

Using a logarithm to transform the response variable may make sense if we notice either or both of the following when checking the assumptions for linear regression:

1. The residuals appear skewed, violating the normality assumption. This can happen if the relationship we are trying to model is non-linear. 
2. There appears to be a pattern or asymmetry in the plot of residuals vs. fitted values, violating the homoscedasticity assumption. This can (also) happen due to a non-linear relationship or if there is more variation in the outcome variable for particular values of a predictor.

Sometimes, violated regression assumptions (as described above) indicate that we simply should not use a linear regression model; but if transforming the response variable appears to correct these violations, we may be justified in (carefully) proceeding!

## Example dataset

As an example, we'll use a dataset called `countries` which is a cleaned subset of larger dataset from [Kaggle]. This dataset contains variables for 221 countries for the years 1970-2017, including the following: 

* `birth_rate` -- a country's birth rate as births per 1000 people
* `phones` -- a country's number of phones per 1000 people

Though the concepts in this article certainly apply to multiple linear regression, we'll use a simple linear regression as an example. Let's say we are interested in predicting `phones` from `birth_rate` using a linear model. First, let's [read in the CSV dataset], examine the first few observations, and look at a scatter plot of number of phones versus birth rate.

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

countries = pd.read_csv('countries.csv')
print(countries.head())

# Scatter plot with regression line
sns.lmplot(x='birth_rate', y='phones', ci=None, data=countries)
plt.title('Number of Phones vs Birth Rate', fontsize=16, weight='bold')
plt.show()
```

| | country | birth_rate | phones |
|---|---------------|------------|--------|
| 0 | Afghanistan | 46.60 | 3.2 |
| 1 | Albania | 15.11 | 71.2 |
| 2 | Algeria | 17.14 | 78.1 |
| 3 | AmericanSamoa | 22.46 | 259.5 |
| 4 | Andorra | 8.71 | 497.2 |


![Scatter plot of phones on the y-axis plotted against birth rate on the x-axis. The pattern appears negative, but the points are widely dispersed on the left and narrow to almost a flat line of points as we move right. The automatically plotted regression line shows a negative relationship, but cuts through curves in the pattern of points. The points of nine countries are highlighted and their residuals are drawn in. From left to right across the plot the countries are: Bermuda, Cuba, Australia, Algeria, Israel, Fiji, Nepal, Mayotte, and Angola.]

The scatter plot shows a negative correlation between `phones` and `birth_rate`. However, there are some indications that a simple linear regression may not be appropriate for this data:

- The relationship between `phones` and `birth_rate` is more curved than linear
- There is more variation in `phones` for small values of `birth_rate` than for large values

To highlight this, we've circled some countries in the plot and have drawn arrows from the points down to the regression line -- these are the residuals for these points. We can see a lot of variability in the size of residuals for low birth rates, with very minimal variability for higher birth rates. 

To better check our regression assumptions, we can fit the regression in [Python] using the following code and save both the residuals and predicted response values as the objects `residuals1` and `fitted_values1`, respectively. 

```python
import statsmodels.api as sm

# Fit regression model
model1 = sm.OLS.from_formula('phones ~ birth_rate', data=countries).fit()

# Save fitted values and residuals
'fitted_values1' = model1.predict(countries)
'residuals1' = countries.phones - fitted_values1
```

Now we'll produce some plots to check the modeling assumptions of normality and homoscedasticity of the residuals.

```python
# Check normality of residuals
plt.hist(residuals1)
plt.title('Model 1: Histogram of Residuals', fontsize=16, weight='bold')
plt.show()

# Check variance of residuals
plt.scatter(fitted_values1, residuals1)
plt.axhline(y=0, color='black', linestyle='-', linewidth=3)
plt.title('Model 1: Residuals vs Fitted Values', fontsize=16, weight='bold')
plt.show()
```

![Histogram of residuals that is right skewed.]

![Scatter plot of residuals versus fitted values that shows points clustered close together on the left, widening dramatically as we move from left to right.]

In the histogram, we see some right skewing caused by the few very high residuals for countries like Bermuda, indicating we may not be meeting the normality assumption. Perhaps more concerning, the scatter plot of residuals against fitted values shows a wave-like pattern from narrow to wide, rather than the constant spread we look for to indicate that homoscedasticity has been met. We've additionally highlighted the same countries in the scatter plot again so we can see how their residuals map out in this plot compared to where we saw them in the original.

## Log transformation in Python

Since we see two potential assumption violations, we are going to try a log transformation of the `phones` variable and check if it improves our concerns. In Python, we can easily take the log of `phones` using the [NumPy] function [`.log()`]. Let's add this new variable to our dataset and see how it looks compared to `phones`. Note that, generally, when we see _log_ with no specified base in a statistics equation, we can assume the base is _e_ (the mathematical constant 2.718...). In other words, _log_ with no base means we are taking the _natural log_, or _ln_. Also, note that we can only take the log of a variable with values greater than zero; the log of values less than or equal to zero are undefined.

```python
import numpy as np
# Save log_phones to dataset
countries['log_phones'] = np.log(countries.phones)
print(countries.head())
```

| | country | birth_rate | phones | log_phones |
|---|---------------|------------|--------|------------|
| 0 | Afghanistan | 46.60 | 3.2 | 1.163151 |
| 1 | Albania | 15.11 | 71.2 | 4.265493 |
| 2 | Algeria | 17.14 | 78.1 | 4.357990 |
| 3 | AmericanSamoa | 22.46 | 259.5 | 5.558757 |
| 4 | Andorra | 8.71 | 497.2 | 6.208992 |


We can see that this transformation has drastically reduced the range of values for our dependent variable. Let's run a second model predicting `log_phones` from `birth_rate` and see what else has changed.

```python
# Fit regression model
model2 = sm.OLS.from_formula('log_phones ~ birth_rate', data=countries).fit()
# Save fitted values and residuals
'fitted_values2' = model2.predict(countries)
'residuals2' = countries.log_phones - fitted_values2
```

If we examine the scatter plot of `log_phones` against `birth_rate`, we can see a big change in the appearance of our data:

![Scatter plot of log phones versus birth rate. The pattern is still negative, but the points now fit closer to the regression line and are more evenly spaced around it.]


While there's some crowding in the upper lefthand corner, the pattern now appears much more linear and more evenly spaced about the regression line. Specifically, countries that had larger residuals earlier (like Bermuda and Australia) are now much closer to the line and each other vertically. Likewise, countries that had small residuals earlier (like Mayotte and Angola) are now further from the line and each other vertically. This change is reflected in both the histogram of the residuals (now much less skewed) and the scatter plot of the residuals versus the fitted values (now much more evenly spaced across the line y = 0).

![Histogram of residuals that is more bell-shaped than the first skewed histogram.]

![Scatter plot of residuals against fitted values that shows a more even spread of points about the line y = 0 compared to the first model's scatter plot of residuals against fitted values.]

## Interpretation

While it's great that our new variable seems to be better meeting our model assumptions, how do we interpret the coefficients in our model now that logs are involved? First, let's look at the output of the model predicting `log_phones` from `birth_rate` and write out the regression equation:

```python
print(model2.params)
# Output:
# Intercept 7.511024
# birth_rate -0.130456
```

```tex
log(phones) = 7.51 - 0.13*birth\_rate
```

We can always interpret the coefficient on `birth_rate` in the traditional way: for every increase of one birth per 1000 people, the natural log of `phones` decreases by 0.13 phones per 1000 people. While this is accurate, it's not very informative about the relationship between `phones` and `birth_rate`. To examine this relationship, we need to do a little math with logs and exponentiation. 

To get a more direct relationship between `phones` and `birth_rate`, we first have to _exponentiate_ the coefficient on `birth_rate`. This means we raise _e_ to the power of the coefficient on `birth_rate`. We may write this as _e-0.13_, or more simply as _exp(-0.13)_, and we can use NumPy to compute this in Python. In short, we're doing this because exponentiating both sides of the regression equation cancels out the log on `phones`, but we'll save the more thorough explanation for the bonus section at the end of this article.

```python
import numpy as np
np.exp(-0.13)
# Output
# 0.8780954309205613
```
Then we also subtract 1 to change our coefficient into an easily readable percentage change:

```python
np.exp(-0.13)-1
# Output:
# -0.1219045690794387
```

We are now ready to interpret this coefficient: for every additional birth per 1000 people, the number of phones per 1000 people decreases by about 12.2 PERCENT. Our interpretation changes from the traditional _additive_ relationship, where increases in the predictor are associated with differences in UNITS of the outcome, to a _multiplicative_ relationship, where increases in the predictor are associated with differences in the PERCENTAGE of the outcome. 

We also see this change in the interpretation of the intercept: rather than the _arithmetic_ mean, the exponentiated intercept _exp(7.51)_ is the _geometric_ mean number of phones for countries with a birth rate of 0. The arithmetic mean is computed by SUMMING values, while the geometric mean is computed by MULTIPLYING values.

## Conclusion

Log transformations of the dependent variable are a way to overcome issues with meeting the requirements of normality and homoscedasticity of the residuals for multiple linear regression. Unfortunately, a log transformation won't fix these issues in every case (it may even make things worse!), so it's important to reassess normality and homoscedasticity after making the transformation and running the new model. Log transformations can also be performed on predictors, and there are other dependent variable transformations available as well (e.g., square-rooting). To learn more about some of these transformations, check out the [Penn State Statistics Department's website].

## Bonus: Logs in more detail

### Why did taking the log of the dependent variable help?

As we recall from the scatter plot of `phones` versus `birth_rate`, there were a lot of large positive residuals for lower birth rates and a lot of smaller residuals for higher birth rates. Taking the log of `phones` brought the large residuals lower and the small residuals higher, which gave us a more even spread with less extremes. But why did this happen? Let's take a quick look at what happens as _e_ is raised to higher exponents. Note that we use 2.718 as an approximation of _e_ here.

| power | _epower_ | multiplied | output | difference |
|-------|---------------------|--------------------------------------------|--------|------------|
| 1 | _e1_ | 2.718 | 2.718 | --- |
| 2 | _e2_ | 2.718\*2.718 | 7.388 | 4.670 |
| 3 | _e3_ | 2.718\*2.718\*2.718 | 20.079 | 15.409 |
| 4 | _e4_ | 2.718\*2.718\*2.718\*2.718 | 54.576 | 34.497 |
| 5 | _e5_ | 2.718\*2.718\*2.718\*2.718\*2.718 | 148.336| 93.760 |
| 6 | _e6_ | 2.718\*2.718\*2.718\*2.718\*2.718\*2.718 | 403.178| 254.842 |

As we can see from the table, every time the power _e_ is raised to increases, the output nearly triples. This means the difference in the outputs between low powers is smaller than the difference in outputs between larger powers. Taking the log of the output column "undoes" this process, returning the corresponding value in the power column (e.g., _log(2.718) = 1_, _log(7.388) = 2_, etc.). 

In terms of our dataset, the output column is like the raw `phones` values, and the power column is the new `log_phones` variable. Big differences in the upper values of `phones` translate to the same size jump on the `log_phones` scale as small differences in the lower values of `phones`. Thus, translated to the log scale, the large values of `phones` (like those of Bermuda and Australia) pull in, while the small values of `phones` (like those of Mayotte and Angola) spread out.

### Why do we interpret the exponentiated coefficients on the predictors as percentage differences of the dependent variable?

Let's say _birth\_rate0_ is a value of `birth_rate` and _phones0_ is the value of `phones` at _birth\_rate0_ such that:

```tex
log(phones_0) = 7.51 - 0.13*birth\_rate_0
```

Let's also say _phones1_ is the value of `phones` when `birth_rate` is increased by 1 birth from _birth\_rate0_. Then,

```tex
log(phones_1) = 7.51 - 0.13*(birth\_rate_0 + 1)
```

Next, we distribute the -0.13 and substitute _log(phones0)_ for _7.51 - 0.13*birth\_rate0_. Then we subtract _log(phones0)_ from both sides to isolate the `birth_rate` coefficient of -0.13.

```tex
log(phones_1) = 7.51 - 0.13*birth\_rate_0 - 0.13
```
```tex
log(phones_1) = log(phones_0) - 0.13
```
```tex
log(phones_1) - log(phones_0) = -0.13
```

Finally, under the quotient rule, we find that our coefficient on `birth_rate` is equal to a single log. We exponentiate both sides to find our exponentiated coefficient on `birth_rate` is equal to a simple quotient that gives the percentage change in the `phones` variable between _phones0_ and _phones1_.

```tex
log(\frac{phones_1}{phones_0}) = -0.13
```
```tex
exp(log(\frac{phones_1}{phones_0})) = exp(-0.13)
```
```tex
\frac{phones_1}{phones_0} = exp(-0.13)
```

Learn when to use a log transformation of the dependent variable of your linear regression and how to interpret the resulting regression equation.

Log Transformations (And More)

In this lesson, you will learn how to perform and interpret more complex multiple regressions that account for interactions between predictors and polynomial relationships.

In the real world, most relationships between variables are difficult to describe with one simple straight line. For example, consider the code and scatter plot below showing happiness level (y-axis) versus stress level (x-axis) colored by exercise.
```python
import seaborn as sns
import matplotlib.pyplot as plt
sns.lmplot(x='stress', y='happy', hue='exercise', markers=['o','x'], fit_reg=False, data=happiness)
plt.show()
```

![Scatter plot showing happy level on the y-axis against stress level on the x-axis. Points for the exercise group are given as orange crosses and those for the non-exercise group are given as blue circles.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/e1_interaction.svg)

Imagine drawing two lines through the points: one for the orange crosses of the exercise group and one for the blue circles of the non-exercise group. Your lines might look something like this:

![Scatter plot showing happy level on the y-axis against stress level on the x-axis. Points for the exercise group are given as orange crosses and those for the non-exercise group are given as blue circles. Two negatively sloped lines intersect starting at different intercepts: a solid orange line for the exercise group and a dotted blue line for the non-exercise group.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/e1_lines.svg)

Note that the lines have both different intercepts AND different slopes. This means that exercise may modify the relationship between happiness and stress.

Other times, the relationship between two variables appears more CURVILINEAR, or curved in shape, than straight. 

![Scatter plot showing happy level verses amount of sleep. From left to right, the points increase, peak, and then begin to decrease. A dashed line follows the curve of the points.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/e1_polynomial.svg)

When we are using multiple regression to investigate the relationship between more than two variables, we may use interaction and polynomial terms to capture more complex relationships among the variables. To do this in Python, we modify our regression model formula to include extra terms. As a result, we also have to adjust our interpretations to match the new complexity of the model.


Why We Need Interaction and Polynomial Terms

Let's return to a plot we saw in Exercise #1.

![Scatter plot showing happy level on the y-axis against stress level on the x-axis. Points for the exercise group are given as orange crosses and those for the non-exercise group are given as blue circles. Two negatively sloped lines intersect starting at different intercepts: a solid orange line for the exercise group and a dotted blue line for the non-exercise group.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/e1_lines.svg)

The data for this plot is from a fictional study on happiness that measures the following variables about its participants:

* `happy` -- their happiness level on a quantitative scale of 1 to 10
* `stress` -- their stress level on a quantitative scale of 1-10
* `exercise` -- whether they exercise regularly, where `1 = yes` and `0 = no`

We have drawn in a line estimating the relationship between stress and happiness for each exercise group. The line for the group that exercises appears flatter than that for the non-exercise group.

This indicates that `exercise` might modify the relationship between stress and happiness. Perhaps regular exercise buffers the effects of stress on happiness. Or perhaps people who exercise are also likely to do stress-reducing activities like meditation. While we don't know the exact reason, we do see a potential difference when we examine the exercise groups separately.

If we fit a regression modeling `happy` from the quantitative predictor `stress` and the binary predictor `exercise`, we get the following results:
```python
import statsmodels.api as sm
model = sm.OLS.from_formula('happy ~ stress + exercise', data=happiness).fit()
print(model.params)
# Output:
# Intercept    10.256296
# stress       -0.707925
# exercise     -0.894058
```

Using these coefficients, we can plot two lines with differing intercepts for each exercise group.

![Scatter plot showing happy versus stress with two parallel lines: A lower one for the exercise group and a higher one for the non-exercise group.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/e2_parallel.svg)

Our lines have different intercepts, but seem to be missing the steeper slope of the points for the non-exercise group. Since a model for `happy` with just `stress` and `exercise` as predictors only allows for the intercepts to differ, we must add an interaction term to our model to capture the difference in slopes.

Visualizing Interactions: Binary and Quantitative

In the last exercise, we ran a regression predicting happiness scores from stress scores and exercise participation without an interaction term. We got the following model coefficients:
```python
# Output:
# Intercept    10.256296
# stress       -0.707925
# exercise     -0.894058
```

Using these coefficients, our regression equation is:

```tex
\text{happy} = 10.3 - 0.7*\text{stress} - 0.9*\text{exercise}
```

In the Python library `statsmodels.api`, we can easily add an interaction term to the model formula by adding a third predictor that combines `stress` and `exercise` with a colon (`stress:exercise`). The code to run the updated model and print the coefficients is shown below.

```python
import statsmodels.api as sm
model = sm.OLS.from_formula('happy ~ stress + exercise + stress:exercise', data=happiness).fit()
print(model.params)

# Output:
# Intercept          12.053583
# stress             -0.971225
# exercise           -3.135705
# stress:exercise     0.357365
```

In addition to the expected coefficients, when we add the interaction term, the coefficient table shows a new term with a coefficient: `stress:exercise`. The coefficient on `stress:exercise` is really a coefficient on a whole new variable formed by multiplying `stress` by `exercise`. Thus, our regression equation for this model looks like this:

```tex
\text{happy} = 12.1 - 1.0*\text{stress} - 3.1*\text{exercise} + 0.4*\text{stress}*\text{exercise}
```

Note that our other coefficients changed slightly with the additional predictor. This is because we have explicitly pulled out more of the relationship between stress and exercise, causing the other coefficients to adjust to take this into account.


Interactions in Python: Binary and Quantitative

By adding an interaction term for our binary predictor, we have made our model more complex and have therefore also added complexity to its interpretation.

Returning to our multiple regression equation with an interaction term from the last exercise, we have:

```tex
\text{happy} = 12.1 - 1.0*\text{stress} - 3.1*\text{exercise} + 0.4*\text{stress}*\text{exercise}
```

We can rewrite this equation for the group that doesn't exercise regularly (`exercise` = 0) and for the one that does (`exercise` = 1).

When `exercise` = 0, the last two terms become zero and go away:

```tex
\begin{aligned}
\text{happy} = 12.1 - 1.0*\text{stress} - 3.1*0 + 0.4*\text{stress}*0 \\
\text{happy} = 12.1 - 1.0*\text{stress} - 0 + 0 \\
\text{happy} = 12.1 - 1.0*\text{stress} \\
\end{aligned}
```

When `exercise` = 1, the intercept goes down by 3.1 and the coefficient on `stress` increases by 0.4:

```tex
\begin{aligned}
\text{happy} = 12.1 - 1.0*\text{stress} - 3.1*1 + 0.4*\text{stress}*1 \\
\text{happy} = 12.1 - 1.0*\text{stress} - 3.1 + 0.4*\text{stress} \\
\text{happy} = (12.1 - 3.1) + (- 1.0 + 0.4)*\text{stress} \\
\text{happy} = 9.0 - 0.6*\text{stress}
\end{aligned}
```

We can see the coefficient on `exercise` tells us the difference in INTERCEPTS between the two exercise groups. In this case, the intercept of the regression line for the group that exercises (9.0) is 3.1 units lower than that of the group that doesn't exercise (12.1).

On the other hand, the coefficient on the interaction term tells us the difference in SLOPES between the two regression lines. The slope on `stress` for the group that exercises (-0.6) is 0.4 units greater than that of the group that doesn't exercise (-1.0). This would lead us to conclude that the happiness level of the group who exercises is less negatively impacted by stress.

Interpreting Interactions: Binary and Quantitative

In the last few exercises we examined interactions between a quantitative predictor and a binary predictor, but we may also wish to use an interaction term for two quantitative variables. Consider the scatter plot of `happy` versus `stress`, this time colored by the quantitative variable `freetime`, which represents the number of hours of free time a participant has on average each day.

![Scatter plot showing happy versus stress colored by the number of hours of free time, ranging from 0 to 6, becoming progressively darker in color as the value increases. Darker points are found in the upper lefthand corner becoming progressively lighter as the points move in a negative direction across the plot.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/e5_freetime.svg)

If we divided the points into groups based on their `freetime` value and fit a regression line for each group, would all the lines have the same slope?

* If we wanted to fit a line amongst the darker points (`freetime` between 5 and 6), the line might have a flat but negative slope.
* In contrast, a line for lighter points (`freetime` between 0 and 2) might be steeper in slope. 

Thus, if we wanted to fit a regression for this data, we might consider fitting several lines for different values of `freetime` rather than a single one across all points. Much like in the previous exercises, we can achieve this by adding a term to the model for the interaction of `stress` and `freetime`.



Interactions: Two Quantitative

Building on the last exercise, let's run a model predicting `happy` from `stress` and `freetime` with an interaction term for the predictors. 

```python
import statsmodels.api as sm
modelQ = sm.OLS.from_formula('happy ~ stress + freetime + stress:freetime', data=happiness).fit()
print(modelQ.params)

# Output:
# Intercept          7.731785
# stress            -0.551098
# freetime           0.187882
# stress:freetime    0.040401
```

We form the regression equation from the coefficients just as we did for an interaction term with a binary variable.

```tex
\text{happy} = 7.73 - 0.55*\text{stress} + 0.19*\textbf{freetime} + 0.04*\text{stress}*\textbf{freetime}
```

We can write a new equation for participants with differing amounts of daily free time.

For participants with **0** hours of free time, the equation is:

```tex
\begin{aligned}
\text{happy} = 7.73 - 0.55*\text{stress} + 0.19*\bm{0} + 0.04*\text{stress}*\bm{0} \\
\text{happy} = 7.73 - 0.55*\text{stress} + 0 + 0 \\
\text{happy} = 7.73 - 0.55*\text{stress} \\
\end{aligned}
```

For participants with **1** hour of free time, the equation is:

```tex
\begin{aligned}
\text{happy} = 7.73 - 0.55*\text{stress} + 0.19*\bm{1} + 0.04*\text{stress}*\bm{1} \\
\text{happy} = 7.73 - 0.55*\text{stress} + 0.19 + 0.04*\text{stress} \\
\text{happy} = (7.73+0.19) + (- 0.55 + 0.04)*\text{stress} \\
\end{aligned}
```

When we simplify and combine terms, we see the intercept increases by 0.19 and the slope increases by 0.04 compared to the participants with 0 hours of free time. An additional 0.19 and 0.04 get added to the intercept and slope, respectively, when we increase `freetime` to **2** hours:

```tex
\begin{aligned}
\text{happy} = 7.73 - 0.55*\text{stress} + 0.19*\bm{2} + 0.04*\text{stress}*\bm{2} \\
\text{happy} = 7.73 - 0.55*\text{stress} + 0.19*2 + 0.04*\text{stress}*2 \\
\text{happy} = (7.73+0.19*2) + (- 0.55 + 0.04*2)*\text{stress} \\
\end{aligned}
```

The 0.19 that gets added to the intercept with each increase in `freetime` is the coefficient on `freetime` from our regression. The 0.04 that gets added to the slope of `stress` with each increase in `freetime` is the coefficient on `stress:freetime` from our regression.

Interactions in Python: Two Quantitative

Now that you've seen that an interaction term for two quantitative variables creates regression lines for each value of one of the interacting variables, let's visualize this in a plot. We start with code for a scatter plot of `happy` against `stress` colored by `freetime`:
```python
import seaborn as sns
import matplotlib.pyplot as plt
sns.lmplot(x='stress', y='happy', hue='freetime', palette='Purples', fit_reg=False, data=happiness)
```
Next, we'll add lines to the plot for a few sample values of `freetime`: 0, 3, and 6 hours. Rather than write out each model coefficient, we can call them directly from `modelQ`, where we stored our regression results in the last exercise. Here is the code to add lines for 3 and 6 hours of free time:
```python
plt.plot(happiness.stress, modelQ.params[0]+modelQ.params[1]*happiness.stress+modelQ.params[2]*3+modelQ.params[3]*happiness.stress*3, color='mediumpurple', linewidth=3)

plt.plot(happiness.stress, modelQ.params[0]+modelQ.params[1]*happiness.stress+modelQ.params[2]*6+modelQ.params[3]*happiness.stress*6, color='indigo', linewidth=3)

# Add legend and show plot
plt.legend(['0 hours','3 hours','6 hours'])
plt.show()
```

![Scatter plot showing happy versus stress colored by the number of hours of free time, ranging from 0 to 6, becoming progressively darker in color as the value increases. Darker points are found in the upper lefthand corner becoming progressively lighter as the points move in a negative direction across the plot. Additionally, 3 lines are plotted for 0, 3, and 6 hours of free time. The slopes of the lines become progressively flatter as free time increases.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/e7_Qlines.svg)

Just as we saw in the regression equations in the last exercise, the intercepts and slopes both increase by 0.19 and 0.04, respectively, for each additional hour of free time. In context, the relationship between stress and happiness appears to be negative. However, this flattening of the slope with increasing amounts of free time might be interpreted as stress impacting happiness less negatively as people have more free time.

This is NOT to say that the data shows that stress CAUSES unhappiness, or that free time CAUSES stress to be less impactful on happiness, just that the association between happiness and stress looks different for people with different amounts of free time.

Interpreting and Visualizing Interactions: Two Quantitative

At the beginning of this lesson, we mentioned that the relationship between variables often does not appear to be a straight line and may instead be somewhat curved. In this case, we may produce a better fitting regression model by adding a polynomial term that raises our predictor variable to a higher exponent to better account for the curving.

In the Python library `statsmodels.api`, polynomial terms can be added to a multiple regression model formula by adding a term with the predictor of interest raised to a higher power. This may be done using the NumPy function `np.power` and specifying the predictor name and degree.

![Scatter plot showing happy level verses amount of sleep. From left to right, the points increase, peak, and then begin to decrease. A dashed line follows the curve of the points.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/e1_polynomial.svg)

For example, the plot above suggests a curvilinear relationship between happiness (`happy`) and hours slept on average (`sleep`). We can add a quadratic (squared) term for the variable `sleep` by doing the following:
```python
import statsmodels.api as sm
import numpy as np
modelP = sm.OLS.from_formula('happy ~ sleep + np.power(sleep,2)', data=happiness).fit()
print(modelP.params)

# Output:
# Intercept -0.058995
# sleep 1.320429
# np.power(sleep, 2) -0.061827
```

This creates a second predictor term in the model with an additional coefficient. Correspondingly, the new term shows up in our model equation as `sleep` squared.

_happy = -.06 + 1.32\*sleep - .06\*sleep2_

We can check happiness scores by substituting in different values of sleep.
 * For 2 hours of sleep:
 
 _happy = -.06 + 1.32\*2 - .06\*22 = 2.34_
 
 * For 10 hours of sleep:
 
 _happy = -.06 + 1.32\*10 - .06\*102 = 7.14_
 * For 14 hours of sleep:
 
 _happy = -.06 + 1.32\*14 - .06\*142 = 6.66_

Our curved model picks up on the pattern that beyond about 10 hours of sleep, more sleep is not associated with greater happiness. Perhaps the people who sleep a lot are ill so they need more sleep and are less happy, or perhaps sleeping too much causes problems in other parts of their lives that takes away from their happiness. We can't know the exact cause from our model, just the association.

Note that we can use Python and our model results to perform the above computations for us. Below is code to compute `happy` when `sleep` is 10 and 14. Since we use exact rather than rounded numbers here, the results are slightly different but more accurate than our original work.
```python
print(modelP.params[0] + modelP.params[1]*10 + modelP.params[2]*np.power(10,2))
# Output:
# 6.9626120786831445
print(modelP.params[0] + modelP.params[1]*14 + modelP.params[2]*np.power(14,2))
# Output:
# 6.308953414816589
```

Fitting Polynomial Terms in Python

The results of the regression from the previous exercise, saved as `modelP`, modeled happiness level from hours of sleep with a polynomial term. The model coefficients are given below:

```python
import statsmodels.api as sm
import numpy as np
modelP = sm.OLS.from_formula('happy ~ sleep + np.power(sleep,2)', data=happiness).fit()
print(modelP.params)

# Output:
# Intercept            -0.058995
# sleep                 1.320429
# np.power(sleep, 2)   -0.061827
```

It is generally difficult to interpret the coefficients on polynomial terms directly, but we can interpret the overall relationship if we visualize our regression line on the scatter plot. We can plot the regression line for `happy` predicted by `sleep` WITHOUT a polynomial term by excluding the `fit_reg=False` argument in `lmplot()`. However, note that we've added `ci=None` to prevent the function from plotting a confidence interval.

```python
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
sns.lmplot(x='sleep', y='happy', ci=None, data=happiness)
```

To add our curved regression line to the scatter plot, we first create a dataset of 100 values of `sleep` ranging from 2 to 14 (saved as `x`) and 100 predicted values of `happy` from our regression equation (saved as `y`). Then we plot our `x` and `y` values using `plt.plot()` and add a legend.

```python
x=np.linspace(2,14,100)
y=modelP.params[0]+modelP.params[1]*x+modelP.params[2]*np.power(x,2)

plt.plot(x, y, linestyle='dashed', linewidth=4, color='black')
plt.legend(['Simple Model','Polynomial Model'])
plt.show()
```

![Scatter plot of happy against sleep. A dashed line in black for the polynomial model follows the curved pattern of the points. A solid blue line runs straight across the plot in a positive direction for the simple regression model.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/e9_polyVstraight.svg)

From the plot of the polynomial regression line, we see that happiness increases as sleep increases, but that the increase slows as sleep reaches around 10 hours and then begins to decrease with further hours of sleep.

The simple regression line misses these details and describes the relationship as a steady increase in happiness for any additional hour of sleep. Thus, the polynomial model captures a more detailed relationship that will make our predictions better.

Interpreting Polynomial Terms

Congratulations! You have now learned:
1. We need interaction and polynomial terms for more complex situations.
2. How to fit and interpret interaction terms for a binary predictor with a quantitative predictor.
3. How to fit and interpret interaction terms for two quantitative predictors.
4. How to fit and interpret polynomial terms.

One final note: You may be wondering how we can be adding multiplied and squared terms and still consider our models to be linear in nature. Although we can add interaction and polynomial terms to a multiple regression model, the model is still considered a multiple LINEAR regression model because the COEFFICIENTS themselves are not raised to higher powers or multiplied by one another. 

In other words, the model does not consider the polynomial or interaction terms any differently than any other variable; when we add an interaction or polynomial term, it's like we're just adding another predictor to the model that happens to be a composite of some of the other predictors.

Interactions and Polynomial Terms

This lesson covers multiple linear regression for binary categorical and quantitative variables.

Simple linear regression may be used to understand and make predictions about the relationship between two variables. But what happens when there are more than two variables?  
  
When we want to understand the relationship between a quantitative variable and two or more predictor variables, we use multiple linear regression. In multiple linear regression, the response variable must be quantitative, while the predictors may be either quantitative, categorical, or a mix of both. For example, we might ask:

* How is blood pressure associated with exercise and anxiety level?
* What is the relationship between happiness scores and income level, family size, and marital status?

We can use the following code to create a plot to explore a dataset called `survey`. This fictional survey data measures students' math scores (`score`), hours spent studying (`hours_studied`), and whether they ate breakfast on test day (`breakfast`).
```python
import seaborn as sns
import matplotlib.pyplot as plt

sns.lmplot(x='hours_studied', y='score', hue='breakfast', markers=['o', 'x'], fit_reg=False, data=survey)
plt.show()
```
Note that we set `fit_reg` to `False` here. The `lmplot()` function will automatically fit and plot a regression line for us unless we specify otherwise.

![Plot showing hours studied on the x-axis and score on the y-axis. Points colored blue for breakfast = 0 are mostly lower on both axes, while points colored orange for breakfast = 1 are mostly higher on both axes.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/e1_scoresvhours3.svg)

According to the plot:

1. More hours of studying seems to be associated with higher test scores.
2. The group that ate breakfast (`breakfast` = `1`) appears to have a higher average test score than the group that didn't (`breakfast` = `0`).

Performing a multiple linear regression using these variables will allow us to quantify these relationships and understand whether they are likely to persist for new data.

Introduction to Multiple Regression

We often write the equation of a line in the form *y=mx+b*, where *m* is the slope of the line and *b* is the *y*-intercept. Since we will be adding at least two predictors to a multiple regression equation, it is helpful to modify our ordering and notation of this equation: 

* First, we may rewrite this equation by putting the intercept term first and the slope term second. 
```tex
y=b+mx
```
* Next, instead of using the names *b* and *m*, we use the names *b0* and *b1*, respectively.
```tex
y=b_0+b_1x_1
``` 
Notice that we've also changed our variable name *x* to *x1* because it is our FIRST predictor. 
* We are now able to add as many predictors as we need in the form
```tex
y = b_0 + b_1x_1 + b_2x_2 + ... + b_ix_i
```
where *y* is the response variable, *b0* is the intercept, and *bi* is the coefficient on the *i*th predictor variable.
* The "slopes" (*b1*, *b2*, *b3*, etc.) on the variables in multiple regression are called *partial regression coefficients*.
 
While this is the proper mathematical way to write a multiple regression equation, it is often easier to write out the equation using actual variable names. For example, if we are modeling test scores (`score`) based on number of hours studied (`hours_studied`) and another variable that indicates whether a student ate breakfast (`breakfast`), our multiple regression equation might look like this: 

```tex
\text{score} = b_0 + b_1 * \text{hours\_studied} + b_2 * \text{breakfast}
```

Of course, after fitting our model, the intercept (*b0*) and coefficients (*b1* and *b2*) could be filled in with actual numbers from the output of our regression. For instance, our final equation might have an intercept of 32.7, a coefficient of 8.5 on `hours_studied`, and a coefficient of 22.5 on `breakfast`: 
 
```tex
\text{score} = 32.7 + 8.5*\text{hours\_studied} + 22.5*\text{breakfast}
```

Multiple Regression Model Equation

To run a multiple linear regression in Python, we can use the function `OLS.from_formula()` from `statsmodels.api`. For example, if we want to run a regression to predict `score` using `hours_studied` and `breakfast` (contained in a dataset named `survey`), we can fit the model as follows:

```python
import statsmodels.api as sm
model = sm.OLS.from_formula('score ~ hours_studied + breakfast', data=survey).fit()
```

To actually view the results, we can print a summary of them to the console using the following code.

```python
print(model.summary())
```

Rather than printing the entire summary table, we can call the model coefficients directly using `model.params`. We can even call a specific coefficient by order of appearance in the table. For instance:

```python
print(model.params)
# Output:
# Intercept        32.665570
# hours_studied     8.540499
# breakfast        22.495615

print(model.params[0])
# Output:
# 32.66556979549575
```

From the coefficient table, we can see the intercept is approximately 32.7, the coefficient on `hours_studied` is 8.5, and the coefficient on `breakfast` is 22.5.

Fitting a Multiple Regression in Python

Binary categorical variables are variables with exactly two possible values. In a regression model, these two values are generally coded as 1 or 0. For example, a multiple regression equation from the `survey` dataset might look like this:  
  
```tex
\text{score} = 32.7 + 8.5*\text{hours\_studied} + 22.5* \text{breakfast}
```

`breakfast` is a binary categorical predictor with two possible values: "ate breakfast," which is coded as `1` in the model and "didn't eat breakfast," which is coded as `0`. If we substitute these values for `breakfast` in the regression equation, we end up with two equations: one for each group.  
  
For breakfast eaters, we substitute 1 for `breakfast` and simplify:  

```tex
\begin{aligned}
\text{score} =  32.7 + 8.5*\text{hours\_studied} + 22.5*\bm{1}& \\
\text{score} =  32.7 + 8.5*\text{hours\_studied} + 22.5& \\
\text{score} =  (32.7 + 22.5) + 8.5*\text{hours\_studied}& \\
\text{score} =  55.2 + 8.5*\text{hours\_studied}& \\
\end{aligned}
```

For the group that didn't eat breakfast, we substitute 0 for `breakfast` and simplify:  

```tex
\begin{aligned}
\text{score} =  32.7 + 8.5*\text{hours\_studied} + 22.5*\bm{0}& \\
\text{score} =  32.7 + 8.5*\text{hours\_studied} + 0& \\
\text{score} =  32.7 + 8.5*\text{hours\_studied}& \\
\end{aligned}
```

If we inspect these two equations, we see that the only difference is the larger intercept for the group that ate breakfast (55.2) compared to the group that didn't eat breakfast (32.7). The coefficient on `hours_studied` is the same for both groups.  

We can visualize this regression equation by adding both lines to the scatter plot of `score` and `hours_studied` with `plt.plot()` as follows:

```python
import seaborn as sns
import matplotlib.pyplot as plt

sns.lmplot(x='hours_studied', y='score', hue='breakfast', markers=['o', 'x'], fit_reg=False, data=survey)
plt.plot(survey.hours_studied, 32.7+8.5*survey.hours_studied, color='blue',linewidth=5)
plt.plot(survey.hours_studied, 55.2+8.5*survey.hours_studied, color='orange',linewidth=5)
plt.show()
```

![Plot showing hours studied on the x-axis and score on the y-axis. Two parallel regression lines run in a positive direction over the scatter plot: the line for the group that didn't eat breakfast starts at a lower intercept than the line for the group that did eat breakfast.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/e4_binarylines3.svg)

From the plot, we can see the regression lines have the same slope. The orange line for the breakfast-eaters starts higher, but increases at the same rate as the blue line for the group that didn't eat breakfast.

Binary Categorical Variables in Multiple Regression

While we can view a binary categorical variable as a way of creating two new regression equations with different intercepts, we don't need to make these equations every time we want to interpret a binary predictor in a multiple regression equation.  

In the `survey` dataset, `breakfast` is a binary variable that is equal to `1` for students who ate breakfast on test day and `0` for those who didn't. For predicting `score` based on `hours_studied` and `breakfast`, the multiple regression equation is:  

```tex
\text{score} = 32.7 + 8.5*\text{hours\_studied} + 22.5*\text{breakfast}
```

Take a look at the scatter plot with regression lines on top:

![Scatter plot showing hours studied on the x-axis and score on the y-axis. Parallel regression lines for each group of breakfast show a positive relationship between score and hours studied. Dashed vertical lines at 2, 3, and 4 hours studied show the same distance between the lines at these x values.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/e5_constantvars3.svg)  

We can interpret the regression coefficients as follows:

- The `breakfast` variable has a coefficient of 22.5. The interpretation is: holding all other variables constant, students who ate breakfast scored 22.5 points higher than students who did not. "Holding all other variables constant" means that we're comparing breakfast groups among students who studied the same number of hours. Visually, this means that the distance between the two regression lines is always 22.5 for any value of `hours_studied` (the dotted lines in the picture above are all the same length).

- The intercept (32.7) is the average value of the response variable when all predictors in the equation are equal to 0. According to our full regression equation, this means that students who didn't study (`hours_studied = 0`) and didn't eat breakfast (`breakfast = 0`) earned an average score of 32.7 (the y-intercept for the blue line).

Interpretation of Binary Categorical Variables

In the previous exercises, we looked at regression models with one quantitative predictor and one binary predictor, but we can also have models with multiple quantitative predictors. For example, consider the following model using the `survey` dataset (`assignments` is the number of homework assignments the student has completed):

```python
import statsmodels.api as sm
model = sm.OLS.from_formula('score ~ hours_studied + assignments', data=survey).fit()
print(model.params)

# Output:
# Intercept        16.676498
# hours_studied     6.273886
# assignments       4.687796
```

From the coefficients above, our regression equation is:  

```tex
\text{score} = 16.7 + 6.3*\text{hours\_studied} + 4.7*\text{assignments}
```

We can still think of multiple regression as creating a new regression line for each value of a quantitative predictor. However, it is challenging to visualize this because we now have different regression lines for every possible value of `assignments`. To visualize the regression output, it is helpful to choose a few sample values: for example, 1, 5, and 10 assignments.

We can add these lines to our scatter plot of `score` vs. `hours_studied` as before: 

```python
import seaborn as sns
import matplotlib.pyplot as plt

# Create scatter plot of hours_studied and score
sns.lmplot(x='hours_studied', y='score', hue='assignments', palette='Blues', fit_reg=False, data=survey)
```
This time we will directly put the model coefficients into each regression equation by calling them individually from `model.params`. The code for 1, 5, and 10 assignments is given below.
```python
# Add regression line for 1 assignment
plt.plot(survey.hours_studied, model.params[0] + model.params[1]*survey.hours_studied + model.params[2]*1, color='lightblue',linewidth=5)

# Add regression line for 5 assignments
plt.plot(survey.hours_studied, model.params[0] + model.params[1]*survey.hours_studied + model.params[2]*5, color='blue',linewidth=5)

# Add regression line for 10 assignments
plt.plot(survey.hours_studied, model.params[0] + model.params[1]*survey.hours_studied + model.params[2]*10, color='darkblue',linewidth=5)

# Show plot with legend
plt.legend(['assignments=1','assignments=5', 'assignments=10'])
plt.show()
```

![Scatter plot showing hours studied on the x-axis and score on the y-axis. Three parallel lines each show a positive relationship between score and hours studied for 1, 5, and 10 assignments. The intercepts of the lines start higher as the number of assignments increases.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/e6_quantlines3.svg)

We can see in the plot that the slopes of all three lines are the same, but the intercepts differ. As the number of completed assignments increases, the intercept of the corresponding regression line also increases.

Quantitative Variables in Multiple Regression

In a multiple regression model, the coefficient on a quantitative predictor is the expected difference in the outcome variable for a one-unit increase of the predictor, holding all other predictors constant.  
  
For the `survey` dataset, the multiple regression equation is:

```tex  
\text{score} = 16.7 + 6.3*\text{hours\_studied} + 4.7*\text{assignments}
```

The predictor `assignments` is a quantitative variable. Let's substitute a few different values for `assignments` into the regression equation to see how it changes:

  * For students who completed 0 assignments:

```tex  
\begin{aligned}
\text{score} = 16.7 + 6.3*\text{hours\_studied} + 4.7*\bf{0}\\
\text{score} = 16.7 + 6.3*\text{hours\_studied}\\
\end{aligned}
```

  * For students who completed 1 assignment:

```tex  
\begin{aligned}
\text{score} = 16.7 + 6.3*\text{hours\_studied} + 4.7*\bf{1}\\
\text{score} = 21.4 + 6.3*\text{hours\_studied}\\
\end{aligned}
```


  * For students who completed 2 assignments:
  
```tex  
\begin{aligned}
\text{score} = 16.7 + 6.3*\text{hours\_studied} + 4.7*\bf{2}\\
\text{score} = 26.1 + 6.3*\text{hours\_studied}\\
\end{aligned}
```

The only difference between the equations is that we add 4.7 points to the intercept for each additional completed assignment. Thus, among students who studied the same number of hours (i.e., holding all other variables constant), students who completed one more assignment earned a 4.7 point higher test score on average.

Interpretation of Quantitative Variables

Sometimes we use regression to understand the relationship between two variables because we wish to control for potential confounders. For example, based on the `survey` dataset, we may be primarily interested in how studying (`hours_studied`) is related to test score (`score`); however, in order to understand this relationship, we may want to control for additional student attributes, such as whether the student ate breakfast (`breakfast`). 

If we perform a simple linear regression predicting `score` from `hours_studied`, we get the following results:

```python
import statsmodels.api as sm
model0 = sm.OLS.from_formula('score ~ hours_studied', data=survey).fit()
print(model0.params)

# Output:
# Intercept        34.990700
# hours_studied    11.881045
```

However, if we add `breakfast` to the model and inspect the new coefficients, we'll find that the intercept and slope on `hours_studied` have changed:

```python
import statsmodels.api as sm
model1 = sm.OLS.from_formula('score ~ hours_studied + breakfast', data=survey).fit()
print(model1.params)

# Output:
# Intercept        32.665570
# hours_studied     8.540499
# breakfast        22.495615
```
Note that the coefficient on `hours_studied` changes from 11.9 to 8.5. Why does this happen? Perhaps people who eat breakfast are more likely to study longer and also more likely to score better on their exam. Without taking `breakfast` into account, some of the relationship between `score` and `breakfast` is attributed to `hours_studied` instead.

Changes in Predictor Coefficients

In the last exercise, we explored how a coefficient may change when additional predictors are added to a model. When adding predictors entirely reverses the sign of the coefficient, this is called Simpson's Paradox: one model says the direction of the relationship is positive, while the other says it is negative.

Let's look at a simple example of this with the fictional dataset `tv`. We'll start by fitting a simple regression predicting weekly hours of television watched (`tv_hours`) from the number of books in the home (`books`).

```python
import statsmodels.api as sm
model0 = sm.OLS.from_formula('tv_hours ~ books', data=tv).fit()
print(model0.params)
# Output:
# Intercept    12.219967
# books         0.089078
```
The coefficient on `books` is positive: the model indicates that more books is associated with more weekly hours of television. Now, let's add a categorical variable to the model that specifies type of home (`hometype`): rented room, one-bedroom apartment, two-bedroom apartment, or house.

```python
import statsmodels.api as sm
model1 = sm.OLS.from_formula('tv_hours ~ books + hometype', data=tv).fit()
print(model1.params)
# Output
# Intercept            22.952692
# hometype[T.2BR]      11.754848
# hometype[T.house]     5.878007
# hometype[T.room]     -5.880337
# books                -0.093223
```
This model leads to the opposite conclusion: the negative coefficient on `books` indicates that more books is associated with FEWER hours of television watched. How is this possible?

Let's look at a scatter plot of `tv_hours` (y-axis) against `books` (x-axis) colored by `hometype`. This time we'll allow `lmplot` to fit regression lines for each type of home by NOT setting `fit_reg` to `False`. We'll also add the regression line for `model0`, which only has `books` as a predictor.

```python
import seaborn as sns
import matplotlib.pyplot as plt

# Scatter plot
sns.lmplot(x='books', y='tv_hours', hue='hometype', palette='colorblind', markers=['o', 'x', 'v','s'], ci=False, legend=False, data=tv)

# Black line
plt.plot(tv.books, model0.params[0] + model0.params[1]*tv.books, color='black', linewidth=3, label='Only Books')

plt.legend()
plt.show()
```
![Scatter plot showing books on the x-axis and tv hours on the y-axis. Regression lines for each group of home-type show a negative relationship between tv hours and books. Meanwhile, the regression line for the model with just books shows a positive relationship.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/e9_simpsons3.svg)

Broadly, the points slope upward (black line), but the points slope downward within each `hometype` group (colored lines).

This could be because bigger homes mean more room for a larger book collection. Or maybe certain home types are associated with higher incomes, so people have more time and money for television. We can examine the pattern of association from the model, and possibly make good predictions from new data with it, but we can't find out the cause from this type of analysis.

Simpson's Paradox

When doing any type of statistical analysis, we should always keep the assumptions in mind. Multiple linear regression requires some of the same assumptions as simple linear regression:

  1. Linear functional form, which can be assessed by plotting the outcome variable against the predictor variable and looking for a linear relationship
  2. Normality, which can be assessed by plotting a histogram of the residuals and looking for an approximately normal distribution
  3. Homoscedasticity, which can be assessed by plotting residuals against fitted values and confirming that there is no clear pattern
  
In addition, we also have to check that the predictors are not linearly related. This is referred to as *multicollinearity* and can lead to misleading results.

We can detect multicollinearity by checking the correlations between pairs of variables in our data. Correlations close to 1 or -1 may be considered too closely related to both be included in a model. The following code calculates the correlation pairs from dataset `df` and saves them as `corr_grid`.

```python
corr_grid = df.corr()
```

For easy visual detection, we can use Python's `heatmap()` function from `seaborn` to create a heat map of correlations between quantitative variables in a dataset. The code to produce a heat map from `corr_grid` is shown below.

```python
sns.heatmap(corr_grid, xticklabels=corr_grid.columns, yticklabels=corr_grid.columns, vmin=-1, center=0, vmax=1, cmap='PuOr', annot=True)
plt.show()
```

![Plot showing grid of paired correlations from -1 to 1 for variables price, area, rooms, and repairs. Higher absolute correlations are darker.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/advanced-linear-regression/e10_heatmap3.svg)

 
The heat map above is particularly dark purple (near 1) for the `area` and `rooms` variables, indicating a strong linear relationship (corr = 0.95). If we were running a multiple regression to predict `price`, we might decide to keep only one of those two variables in order to avoid multicollinearity. 

Assumptions of Multiple Regression

Congratulations! In this lesson you've learned to:
  * Fit a multiple linear regression model in Python
  * Write and interpret a multiple regression model
  * Understand what binary and quantitative predictor coefficients mean visually and in context
  * Check the assumption that multicollinearity isn't present

In this project, you will explore data on Algerian forests and run multiple linear regression models using variables including temperature, humidity, and fire risk.

The dataset has been loaded for you as `forests`. Since we will be running several multiple linear regressions in this project, we want to check the multicollinearity assumption before we get started. Create a table of correlations for the quantitative variables in `forests` and call it `corr_grid`. Then plot the correlations with a heat map to look for potential collinear variables. Make sure to show, then clear the plot.

Are there any variables that should not go into a model together as predictors?

Let's explore the relationship between relative humidity (`humid`) and maximum temperature (`temp`) by creating a scatter plot with `humid` on the y-axis and `temp` on the x-axis. Color the points by `region`. Make sure to show, then clear the plot.

Does the relationship between humidity and temperature appear different for locations in Bejaia compared to those in Sidi Bel-abbes?

Fit a multiple linear regression model predicting humidity with temperature and region as predictors. Save the fitted model as `modelH` and print the coefficients. Does the coefficient on `temp` match the relationship you expected from the plot you made in the previous step?

Using the model coefficients you printed in the last step, write out the full regression equation as a comment. Then write out the regression equation for locations in Bejaia and the regression equation for Sidi Bel-abbes as additional comments. How are these equations similar and how are they different?

In a comment, write an interpretation for the coefficient on `temp` in the full regression equation. Then, for each region's regression equation, interpret the intercept in terms of the regression line and in the context of predicting humidity from temperature. Do the intercept interpretations make practical sense?

Finally, let's put our model back into a visualization to see if it matches our expectations. Create the same scatter plot of humidity and temperature from step 2, but this time add the two regression lines you found for each region. Make sure to show, then clear the plot.

Do these lines seem to fit the data well?

In this section, we will fit the first of two regression models predicting the Fine Fuel Moisture Code (`FFMC`), a measurement of fuel moisture that contributes to the assessment of fire risk. First, let's explore the relationship between FFMC and temperature and see if that relationship looks different for areas that ended up experiencing a fire compared to those that didn't. Create a scatter plot of FFMC on the y-axis and temperature on the x-axis with points colored by the binary variable for fire status. Make sure you show, then clear the plot.

Do the groups seem to have the same or different slopes?

Since the groups look like they may have regression lines with different slopes, fit a model predicting FFMC with predictors `temp`, `fire`, and a term for their interaction. Save the results as `resultsF` and print the resulting model coefficients. What do we learn about the regression line created with these coefficients?

As a comment, write the full regression equation from the coefficients you printed in the last step. Then, write an equation for each group of `fire`. How are the equations similar, and how are they different?

Using the equations you created in the previous step, write as a comment an interpretation of the coefficient on `temp` for each group of `fire`. What does the difference in these coefficients say about the relationship between FFMC measure and temperature in areas where a fire didn't occur compared to where one did occur?

Let's visualize our model results on the scatter plot from step 7 by adding lines to the plot for the regression equation for each group of `fire`. Make sure you show, then clear the plot. 

Do these lines seem to fit the data better than two parallel lines (i.e., lines with the same slope)?

Now, let's try predicting FFMC from just relative humidity. First, let's get an idea of what the relationship looks like in a scatter plot of FFMC on the y-axis and relative humidity on the x-axis. Make sure to show, then clear the plot.

Will a straight line produce the best fit here?

Since we see a curve in the pattern of points on the plot, run a multiple regression model predicting `FFMC` with `humid` and `humid` raised to the second power (squared) as predictors. Save the results as `resultsP` and print the coefficients. Does it make sense to interpret the coefficient on `humid` as we would normally interpret coefficients on quantitative variables?

In a comment, use the coefficients you found in the previous step to write the regression equation. Using the equation and the model coefficients you found in the last step, find the FFMC for the following relative humidity levels: 25%, 35%, 60%, and 70%. What do you notice about the difference in FFMC when increasing from 25% to 35% compared to increasing from 60% to 70%?

Using the plot from step 12 and the sample predicted values from step 14, write as a comment an interpretation of the relationship between FFMC and relative humidity. What would a straight regression line have implied about their relationship?

So far, we've used weather, location, and fire status variables to predict humidity and a measure of forest fuel moisture. Finally, let's do some exploring using other variables in the dataset. The FFMC is derived from data on humidity, temperature, wind, and rain. Run a multiple regression predicting FFMC from all four variables. How could you better examine these relationships and decide whether to use interaction terms?

The Fire Weather Index (FWI), which measures of general fire intensity potential, takes ISI and BUI as inputs (measures of fire spread potential and heat release potential, respectively). Try running a multiple regression model predicting FWI from the predictors ISI and BUI and printing the coefficients. Are the relationships what you thought they'd be?

Algerian Forest Fires

AIC and BIC include a penalty for additional predictors whereas log-likelihood will always favor larger nested models.

AIC and BIC will find the best model to fit a dataset whereas log-likelihood will find the best model for making predictions for new/unobserved data.

AIC and BIC are used for comparing non-nested models whereas log likelihood is used to compare nested models.

`model2` has an equal or larger R-squared than `model1`.

`model2` has an equal or larger adjusted R-squared than `model1`

`model2` has an equal or smaller AIC than `model1`.

`model2` has an equal or smaller BIC than `model1`.

`'heart_rate ~ age + exercise + altitude'`

`'heart_rate ~ age + exercise + temperature'`

to find the model with the largest possible R-squared

to find the model with the largest possible log likelihood.

Test your knowledge about how to choose a linear regression model!

Compare linear regression models to predict housing prices on Craigslist.

Fit a model that predicts `price` using `type`, `sqfeet`, `beds`, and `baths` as predictors. Save the fitted model as `model1`. 

Fit a model that predicts `price` using `type`, `sqfeet`, `beds`, `baths`, `comes_furnished`, `laundry_options`, `parking_options`, and `smoking_allowed` as predictors. Save the fitted model as `model2`. 

Note that `model1` and `model2` are *nested models* because `model2` contains all of the predictors in `model1`.

Fit a model that predicts `price` using `type`, `sqfeet`, `beds`, `baths`, `comes_furnished`, `laundry_options`, `parking_options`, `smoking_allowed`, `cats_allowed`, and `dogs_allowed` as predictors. Save the fitted model as `model3`. 

Note that `model3`, `model2`, and `model1` are *nested models* because `model2` contains all of the predictors in `model1` and `model3` contains all of the predictors in `model2`.

Print the R-squared for all three models. Approximately what proportion of variation in rental prices can be described using the largest predictor set (`model3`)?

Print out the adjusted R-squared for all three models. Based on adjusted R-squared, which model fits the data best?

Note that the two extra predictors in `model3` (compared to `model2`) are related to pet policies (`cats_allowed` and `dogs_allowed`). Based on your answer to the above: holding all other predictors constant, is there a significant relationship between a housing option's pet policy and its price?

Use the `anova_lm()` function from `statsmodels` (which has already been imported for you in **script.py**) to run an F-test comparing `model2` and `model3`, then print the results. 

Using a significance threshold of 0.05, are the coefficients on `cats_allowed` and `dogs_allowed` significantly different from zero? In other words: holding all other predictors constant, is there a significant relationship between a housing option's pet policy and its price? 

Does your answer based on the F-test match your answer based on adjusted R-squared? Note that these two criteria don't have to agree!



Print the log-likelihood for all three models. Which model has the largest log-likelihood? Does this make sense?


Print the AIC for all three models. Based on AIC, which model fits the data best? 

Would you choose the same model based on AIC as you would based on adjusted R-squared and the F-test?

Print the BIC for all three models. Based on BIC, which model fits the data best? 

Note that BIC tends to favor simpler models with fewer predictors. Would you choose the same model based on BIC as you would based on AIC, adjusted R-squared and the F-test?

We've provided you with code in **script.py** to split the `housing` data into training and test sets. These are saved as `housing_train` and `housing_test`, respectively.

Re-fit `model2` and `model3` using the training dataset and re-save the fitted models as `model2_train` and `model3_train`.

Calculate the fitted values for the test dataset based on `model2_train` and `model3_train`. Save them as `fitted_mod2` and `fitted_mod3`, respectively.

Calculate and print the predicted root mean squared error (PRMSE) for models 2 and 3.

Based on PRMSE, which model performs best with respect to out-of-sample prediction?

In this project, we saw that `model2` and `model3` performed very similarly. `model3` edged out `model2` in most comparisons, but only by a small amount. 

Note that the process of calculating PRMSE involves randomly splitting the data into training and test datasets. Depending on how we split the data, we'll calculate slightly different PRMSE values. If two models have very similar PRMSEs, then different models may "win" depending on how we split the data.

Toward the beginning of **script.py** we've set a random seed using `np.random.seed(1)` to control the way the data is split. Try changing the random seed to a different number besides `1`, then re-run the code. Does model 3 still have a smaller PRMSE?

Try a few more times with different numbers. Can you get a sense for whether model 3 wins out more often &mdash; or is it a toss-up?

Now that you've explored three potential models for housing prices, see if you can improve upon these models using additional terms. Are there any interactions or polynomial terms that you think may improve the model? 

Craigslist Analysis

### Introduction

Statsmodels and scikit-learn are two commonly used packages for linear regression in Python. In this course, we have focused on implementation using statsmodels; however, it is useful to be able to fit models using multiple tools because each one provides functionality that may not be available in the other.

For context, statsmodels was built as an extension to the `scipy.stats` module so as to enable R-like functionality in Python to perform statistical model implementation, testing, and inference. Scikit-learn is built on NumPy and SciPy to enable easy implementation of machine learning algorithms. It also contains a suite of associated model validation methods to fine-tune a model.

### A Comparison

#### statsmodels

 - Pro: Provides comprehensive model summaries, including t-tests for all the coefficients, R-squared, adjusted R-squared, AIC, BIC, log likelihood, F-test, and more.
 - Pro: Allows users to fit models using a formula-based syntax, which makes it relatively simple to test out interaction terms and polynomial terms, compare multiple models, etc.
 - Con: It is missing some useful functions to easily perform operations on statsmodels model objects (e.g., k-fold validation, train-test split, lasso regression).
 - Con: It is used less widely than scikit-learn, so has less detailed documentation and example code available online.

#### scikit-learn

 - Pro: Contains many easy-to-use functions that can perform operations like k-fold validation, train-test split, etc. in a few lines of code.
 - Pro: Great documentation online and a large community of people who have shared their code and asked/answered questions online.
 - Con: The model object contains more limited information (just coefficients and R-squared).
 - Con: To fit a model, scikit-learn requires users to create the design matrix "by-hand" (or using other libraries), which means it requires an extra step to fit models with categorical variables, interaction terms, and/or polynomial terms.
 

Overall, most people use scikit-learn when performing predictive modeling, but aren’t concerned with examining the coefficients or their associated statistics. Meanwhile, statsmodels is great for comparing and fitting complex models; however, in order to use scikit-learn’s tools like k-fold cross-validation, you may need to transform your statsmodels model object into a scikit-learn model object.
 

### Implementation

To compare these libraries, let's fit some models with each and compare the results:

#### statsmodels

All of these examples use a dataset of air quality measurements, which is available via statsmodels. The code below uses this data to fit a model to predict temperature (`Temp`) based on ozone levels (`Ozone`), windspeed (`Wind`) and an interaction between `Ozone` and `Wind`.

```python
# Load libraries
import statsmodels.api as sm

# Get some data
data = sm.datasets.get_rdataset('airquality').data
data.dropna(inplace=True)

# Fit model
model = sm.OLS.from_formula('Temp ~ Ozone + Wind + Ozone:Wind', data).fit()
print(model.summary())
```

Output:

```
 OLS Regression Results 
==============================================================================
Dep. Variable: Temp R-squared: 0.563
Model: OLS Adj. R-squared: 0.551
Method: Least Squares F-statistic: 46.00
Date: Thu, 08 Apr 2021 Prob (F-statistic): 3.54e-19
Time: 15:37:34 Log-Likelihood: -361.26
No. Observations: 111 AIC: 730.5
Df Residuals: 107 BIC: 741.4
Df Model: 3 
Covariance Type: nonrobust 
==============================================================================
 coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 79.3074 3.288 24.123 0.000 72.790 85.825
Ozone 0.0202 0.046 0.443 0.659 -0.070 0.111
Wind -1.0550 0.286 -3.695 0.000 -1.621 -0.489
Ozone:Wind 0.0234 0.006 4.070 0.000 0.012 0.035
==============================================================================
Omnibus: 4.265 Durbin-Watson: 1.169
Prob(Omnibus): 0.119 Jarque-Bera (JB): 3.491
Skew: -0.326 Prob(JB): 0.175
Kurtosis: 2.426 Cond. No. 2.22e+03
==============================================================================
```

#### scikit-learn: 

In scikit-learn, it is relatively easy to fit a model with any predictor set that already exists in our data. For example, we can fit a model to predict temperature based on ozone level (`Ozone`) and windspeed (`Wind`) as follows:

```python
from sklearn.linear_model import LinearRegression

X = data[['Ozone', 'Wind']]
y = data[['Temp']]

# Fit model
model = LinearRegression()
model.fit(X, y)
print(model.intercept_)
print(model.coef_)
```

Output:

```
[73.14445315]
[[ 0.18059202 -0.29723628]]
```


However, if we want to add interaction terms, polynomial terms, or anything else more complex, we need to do that ahead of time. For example, if we want to add an interaction between `Ozone` and `Wind` like we did in statsmodels, we can create a new column in our dataset named `OzoneWind`, which is derived by multiplying `Ozone` and `Wind` together. Then, we can add that column to our model and produce the same coefficients as we calculated with statsmodels:

```python
data['OzoneWind']= data.Ozone*data.Wind
X = data[['Ozone', 'Wind', 'OzoneWind']]
y = data[['Temp']]

# Fit model
model = LinearRegression()
model.fit(X, y)
print(model.intercept_)
print(model.coef_)
```

Output:

```
[79.30741717]
[[ 0.02024914 -1.05495668 0.02342465]]
```

Alternatively, we could create the X matrix with formula notation via the patsy module. Note that we have to include a `0 + ` in front of our formula so that patsy doesn't automatically generate a column of `1`s in the `X` matrix for the intercept (`sklearn.linear_model.LinearRegression` does this under the hood). The code to implement this is shown below:

```python
import patsy

# Fit model
y, X = patsy.dmatrices('Temp ~ 0 + Ozone + Wind + Ozone:Wind', data)
model = LinearRegression() 
model.fit(X, y) 
print(model.intercept_)
print(model.coef_)
```

Output:

```
[79.30741717]
[[ 0.02024914 -1.05495668 0.02342465]]
```

Note that we calculated the same intercept and coefficients using statsmodels and scikit-learn. While statsmodels gave us more information about the model and coefficients, there are some operations that are much simpler in scikit-learn. For example, the following lines of code will split our data into training and test sets. There is no function in statsmodels to easily do the same.

```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) 
```

### Practice
<Assessment id="6074504b113b9a00122dbfab" />


### Recap

In this article, you have learned about two different modules for fitting linear regression models in Python. Both statsmodels and scikit-learn have pros and cons for different applications, so there is no right or wrong choice between them! However, understanding these two different implementations will help you become a more flexible statistician, data scientist, or analyst, who can adapt to ever changing technologies and choose an appropriate option for the task at hand.

Learn about the differences between scikit-learn and statsmodels with respect to linear regression in Python.

Linear Models in scikit-learn vs. statsmodels

In this course, you'll learn how to fit, interpret, and compare linear regression models in Python. This is useful for research questions such as:

- Can I predict how much a customer will spend at a store based on attributes such as age, income, and location?
- What is the relationship between a person's income and other attributes such as education level and years of experience?

This course requires some prior experience with Python, including experience with Pandas and basic data manipulation, summary statistics, and hypothesis testing.

Learn how to fit, interpret, and compare linear regression models in Python.