Learn

Binary categorical variables are variables with exactly two possible values. In a regression model, these two values are generally coded as 1 or 0. For example, a multiple regression equation from the survey dataset might look like this:

$\text{score} = 32.7 + 8.5*\text{hours\_studied} + 22.5* \text{breakfast}$

breakfast is a binary categorical predictor with two possible values: “ate breakfast,” which is coded as 1 in the model and “didn’t eat breakfast,” which is coded as 0. If we substitute these values for breakfast in the regression equation, we end up with two equations: one for each group.

For breakfast eaters, we substitute 1 for breakfast and simplify:

\begin{aligned} \text{score} = 32.7 + 8.5*\text{hours\_studied} + 22.5*\bm{1}& \\ \text{score} = 32.7 + 8.5*\text{hours\_studied} + 22.5& \\ \text{score} = (32.7 + 22.5) + 8.5*\text{hours\_studied}& \\ \text{score} = 55.2 + 8.5*\text{hours\_studied}& \\ \end{aligned}

For the group that didn’t eat breakfast, we substitute 0 for breakfast and simplify:

\begin{aligned} \text{score} = 32.7 + 8.5*\text{hours\_studied} + 22.5*\bm{0}& \\ \text{score} = 32.7 + 8.5*\text{hours\_studied} + 0& \\ \text{score} = 32.7 + 8.5*\text{hours\_studied}& \\ \end{aligned}

If we inspect these two equations, we see that the only difference is the larger intercept for the group that ate breakfast (55.2) compared to the group that didn’t eat breakfast (32.7). The coefficient on hours_studied is the same for both groups.

We can visualize this regression equation by adding both lines to the scatter plot of score and hours_studied with plt.plot() as follows:

import seaborn as sns
import matplotlib.pyplot as plt

sns.lmplot(x='hours_studied', y='score', hue='breakfast', markers=['o', 'x'], fit_reg=False, data=survey)
plt.plot(survey.hours_studied, 32.7+8.5*survey.hours_studied, color='blue',linewidth=5)
plt.plot(survey.hours_studied, 55.2+8.5*survey.hours_studied, color='orange',linewidth=5)
plt.show()

From the plot, we can see the regression lines have the same slope. The orange line for the breakfast-eaters starts higher, but increases at the same rate as the blue line for the group that didn’t eat breakfast.

### Instructions

1.

Code has been provided for you in script.py to fit a regression model predicting port3 based on math1 and address. The fitted model has been saved as model1. Use .params to print the intercept and coefficients from the results and inspect the coefficient for address.

2.

The variable address has two values: R for rural (coded as address = 0 in the model) and U for urban (coded as address = 1). Because we’ve included this binary variable in our model, we’ve actually fit two separate regression lines: one for students who live at a rural address, and one for students who live at an urban address.

Using the output from the model, write out the regression equation for when address is equal to R and save the value of the intercept as interceptR. Then, write out the regression equation for when address is equal to U and save the value of the intercept as interceptU. Finally, since the slope on math1 will be the same for both equations, save this value as slope. Round all final values to one decimal place (i.e., the tenth’s place).

3.

The code for the scatter plot of port3 and math1 has been provided for you in script.py. Using the regression equations you created in the last checkpoint, add a blue line to the scatter plot for rural addresses.

4.

Using the regression equations with rounded values that you created in the second step, add an orange line to the scatter plot for urban addresses. What’s similar about the two lines you just plotted? What’s different?