With the data from Codecademy University, we want to predict whether each student will pass their final exam. Recall that in linear regression, we fit a line of the following form to the data:
where
y
is the value we are trying to predictb_0
is the intercept of the regression lineb_1
,b_2
, …b_n
are the coefficientsx_1
,x_2
, …x_n
are the predictors (also sometimes called features)
For our data, y
is a binary variable, equal to either 1
(passing), or 0
(failing). We have only one predictor (x_1
): num_hours_studied
. Below we’ve fitted a linear regression model to our data and plotted the results. The best fit line is in red.
We see that the linear model does not fit the data well. Our goal is to predict whether a student passes or fails; however, a best fit line allows predictions between negative and positive infinity.
Instructions
We’ve provided you with the code to train a linear regression model on the Codecademy University data and plot the regression line. Run the code and observe the plot. Expand the plot to fullscreen for a larger view.
Using the regression line, estimate the predicted outcomes (given by the line) for students who study 0
hour, 10
hours, and 30
hours, respectively. Save the results to slacker
, average
, and studious
.
How would you use these numerical outcomes to determine whether a student is predicted to pass or fail? Can you think of a threshold you might use?