Binary categorical variables are variables with exactly two possible values. In a regression model, these two values are generally coded as 1 or 0. For example, a multiple regression equation from the survey
dataset might look like this:
breakfast
is a binary categorical predictor with two possible values: “ate breakfast,” which is coded as 1
in the model and “didn’t eat breakfast,” which is coded as 0
. If we substitute these values for breakfast
in the regression equation, we end up with two equations: one for each group.
For breakfast eaters, we substitute 1 for breakfast
and simplify:
For the group that didn’t eat breakfast, we substitute 0 for breakfast
and simplify:
If we inspect these two equations, we see that the only difference is the larger intercept for the group that ate breakfast (55.2) compared to the group that didn’t eat breakfast (32.7). The coefficient on hours_studied
is the same for both groups.
We can visualize this regression equation by adding both lines to the scatter plot of score
and hours_studied
with plt.plot()
as follows:
import seaborn as sns import matplotlib.pyplot as plt sns.lmplot(x='hours_studied', y='score', hue='breakfast', markers=['o', 'x'], fit_reg=False, data=survey) plt.plot(survey.hours_studied, 32.7+8.5*survey.hours_studied, color='blue',linewidth=5) plt.plot(survey.hours_studied, 55.2+8.5*survey.hours_studied, color='orange',linewidth=5) plt.show()
From the plot, we can see the regression lines have the same slope. The orange line for the breakfast-eaters starts higher, but increases at the same rate as the blue line for the group that didn’t eat breakfast.
Instructions
Code has been provided for you in script.py to fit a regression model predicting port3
based on math1
and address
. The fitted model has been saved as model1
. Use .params
to print the intercept and coefficients from the results and inspect the coefficient for address
.
The variable address
has two values: R
for rural (coded as address = 0
in the model) and U
for urban (coded as address = 1
). Because we’ve included this binary variable in our model, we’ve actually fit two separate regression lines: one for students who live at a rural address, and one for students who live at an urban address.
Using the output from the model, write out the regression equation for when address
is equal to R
and save the value of the intercept as interceptR
. Then, write out the regression equation for when address
is equal to U
and save the value of the intercept as interceptU
. Finally, since the slope on math1
will be the same for both equations, save this value as slope
. Round all final values to one decimal place (i.e., the tenth’s place).
The code for the scatter plot of port3
and math1
has been provided for you in script.py. Using the regression equations you created in the last checkpoint, add a blue line to the scatter plot for rural addresses.
Using the regression equations with rounded values that you created in the second step, add an orange line to the scatter plot for urban addresses. What’s similar about the two lines you just plotted? What’s different?