In the last exercise, we tried to eye-ball what the best-fit line might look like. In order to actually choose a line, we need to come up with some criteria for what “best” actually means.
Depending on our ultimate goals and data, we might choose different criteria; however, a common choice for linear regression is ordinary least squares (OLS). In simple OLS regression, we assume that the relationship between two variables x and y can be modeled as:
We define “best” as the line that minimizes the total squared error for all data points. This total squared error is called the loss function in machine learning. For example, consider the following plot:
In this plot, we see two points on either side of a line. One of the points is one unit below the line (labeled -1). The other point is three units above the line (labeled 3). The total squared error (loss) is:
Notice that we square each individual distance so that points below and above the line contribute equally to loss (when we square a negative number, the result is positive). To find the best-fit line, we need to find the slope and intercept of the line that minimizes loss.
The interactive visualization in the browser lets you try to find the line of best fit for a random set of data points:
- The slider on the left controls the
- The slider on the right controls the
You can see the total loss on the right side of the visualization. To get the line of best fit, we want this loss to be as small as possible.
To check if you got the best line, check the “Plot Best-Fit” box.
Randomize a new set of points and try to fit a new line by entering the number of points you want (try
8!) in the textbox and pressing Randomize Points.
Play around with the applet and see if you can find a best-fit line for the points you have randomized.