In the previous exercises, we used a quantitative predictor in our linear regression, but it’s important to note that we can also use categorical predictors. The simplest case of a categorical predictor is a binary variable (only two categories).
For example, suppose we surveyed 100 adults and asked them to report their height in cm and whether or not they play basketball. We’ve coded the variable bball_player
so that it is equal to 1
if the person plays basketball and 0
if they do not. A plot of height
vs. bball_player
is below:
We see that people who play basketball tend to be taller than people who do not. Just like before, we can draw a line to fit these points. Take a moment to think about what that line might look like!
You might have guessed (correctly!) that the best fit line for this plot is the one that goes through the mean height for each group. To re-create the scatter plot with the best fit line, we could use the following code:
# Calculate group means print(data.groupby('play_bball').mean().height)
Output:
play_bball | |
---|---|
0 | 169.016 |
1 | 183.644 |
# Create scatter plot plt.scatter(data.play_bball, data.height) # Add the line using calculated group means plt.plot([0,1], [169.016, 183.644]) # Show the plot plt.show()
This will output the following plot (without the additional labels or colors):
Instructions
Using the dataset students
(which has been loaded for you in script.py), plot a scatter plot of score
(y-axis) against breakfast
(x-axis) to see scores for students who did and did not eat breakfast.
Code has been provided for you in script.py to calculate the mean test score for students who ate breakfast and the mean score for students who did not eat breakfast. Use these numbers to plot the best-fit line on top of the scatter plot.