An Introduction to Regression

2,901 Views
Feb 01, 2019

Machine learning is the field of computer science that gives computer systems the ability to learn from data — and it’s one of the hottest topics in the industry right now.

We will now study the topic of linear regression. Where we will seek to understand, what exactly is meant by regression, what is the kind of problem it is meant to solve, and how exactly regression is applied. Previously in the learning path, we have covered that regression is a technique which is used in order to make a prediction on a continuous value.

So given a set of input features, for example, the features describing a particular house, a value which can be predicted is the cost of the house. To understand exactly how this can work, let us consider an example where we have a single feature. Which is the amount of exercise performed by some individuals, and a single output value which needs to be predicted, which is the weight of the individuals.

[Video description begins] The title Two-Dimensional data appears and disappears. A graph is plotted on the screen. The X-axis denotes the amount of Exercise, whereas the Y-axis shows the Weight. There are various dots plotted throughout the graph. Above the graph is the text, Expresses relationships, more insightful. [Video description ends]

In order to understand the relationship between exercise and weight, it is useful to plot the data in a two-dimensional plane, just as we have over here. And a quick glance confirms to us that there does seem to be some kind of relationship between the two. So how exactly can we express this relationship? Well, we could draw some kind of curve, for example, this squiggly one over here. Which will go through many of the points and will be very close to the others in a plot.

[Video description begins] A random line with many curves is drawn across the graph. It connects some of the points on the graph. [Video description ends]

On the other hand, the relationship could be modeled using a simple straight line. So for example, there could be these two different straight lines which go through many of the points. And we could consider either of them to be representative of the relationship between exercise and weight.

[Video description begins] Two intersecting straight lines appear on the screen. [Video description ends]

When any relationship is modeled using a straight line or a plane, rather than some kind of squiggly or complex curve, we say that the relationship between the variables is linear. We have already seen that there can be multiple curves or lines which can represent the relationship between two variables. But which of these happens to be the best fit line or curve?

Well, simply put, it is that line or curve which happens to be closest to the points in your data. So once you have drawn a line or a curve, you can drop vertical lines from each of your points to that curve. And for each point, the length of the line determines the difference between the weight which will be predicted by the curve, and the actual weight of the individual. As you can imagine, the curve which minimizes the distance between each point and the curve itself is the one which can be considered to represent the best fit for your data.

[Video description begins] Another curved line appears after the previous lines disappear. This time the distance between the dots and the line is denoted by dotted lines. The text, Distances of the points from the curve should be minimized, appears above the graph. [Video description ends]

However, there is one caveat when using very complex curves to represent your data. Specifically, your curve may represent an overfitted model.

[Video description begins] The text above the graph disappears and new text appears. It reads, A complex, overfitted curve reduces predictive accuracy. A new dot appears away from the cluster near the x-axis, which has a longer dotted line connecting it to the curve. [Video description ends]

So if all of the green points over here represent your training data, and you come up with this complex curve to model the relationship between exercise and weight. There is the very high likelihood that your model is very specific to your training data, including its own biases. So when it is time to make predictions on the real data, such as this orange point over here, then the model may not perform particularly well.

Which is why it is often best to come up with a more general solution, in order to model the relationships between your data. And this is exactly why a best fit straight Line is often preferred over a complex curve. This is precisely what a linear regression accomplishes, where a best fit straight line is found to represent the relationship between your inputs and your outputs.

[Video description begins] The curve and the dotted lines disappear, and the original graph is restored. This time a straight line is drawn across the cluster of dots, somewhere through the middle of the cluster. [Video description ends]

So just to summarize, when we need to create a machine learning model to predict a continuous value, such as the weight of an individual, using a collection of input features, then the solution is to perform a linear regression. That is, you find the best fit straight line or plane which is able to model the relationships between the inputs and the outputs. But how exactly do we determine this best fit straight line?

[Video description begins] Another graph displays. Many small dots are plotted on the graph. These are a cluster of dots in the middle of the graph, away from both the axes. Two straight lines are also plotted on the graph. The first line is a diagonal through the cluster of dots. This line declines in trajectory along the X-axis. This line is connected to the Y-axis through a dotted line continuing straight from the end of this line. Next to this line, is an equation that defines this line, Line 1: y = A1 + B1x. The second line is a parallel horizontal line closer to the X-axis, away from the cluster of dots. Next to this line, is an equation that defines this line, Line 2: y = A2 + B2x. [Video description ends]

Well, there is a mathematical way to solve this problem. And for that, consider two different straight lines which are represented by the equations y = A1 + B1x, which is the first straight line you see on the top over here. And then there is a second straight line, which is the horizontal one at the bottom. Intuitively, you will know that it is the first straight line which represents a better fit than the second one.

But how exactly can we quantify this? Well, as we touched upon previously, we will drop these vertical lines from each of the data points to each of the two lines. And for each point in a dataset, this distance will represent the error for that point. We will then square up the errors.

[Video description begins] Vertical dotted lines drop from the dots. These dotted lines intersect the diagonal line first and then the horizontal line in the bottom. [Video description ends]

And then for each line, we will sum up the square of the errors. The best fit regression line is that where the sum of the square of the errors is minimized. So once we have done that, we have quantified the fact that line number 1 is a better fit for our data than line 2.

[Video description begins] The graph now shows dotted lines starting from each dot and intersecting the diagonal line showing the variance. [Video description ends]

So the sum of the squares of the errors is one way to quantify the quality of a regression line. And similarly, the mean of the square of the errors, where you divide the sum of the squares by the number of data points, is another metric. There is one more evaluation metric for regression lines. And this is something which, once again, measures the quality of the fit of the data.

And this is known as the R-square metric. So higher the quality of the fit, the higher is the value of the R-square for the data. So considering these data points, if we have a straight line which represents the relationship. And all of the points are very close to this regression line, the value of R-square will be very high.

[Video description begins] The Weight-Exercise graph displays. This time, the dots are plotted very close to each other and the line touches most of the dots. The line arises from the Y-axis and diagonally moves towards the X-axis. This graph depicts high R2 values. [Video description ends]

And one thing to note about R-square values is that this range is from a minimum of 0 to a maximum of 1. You can consider the R-square value to represent how well your regression line captures the variance in the underlying data. So while this particular straight line captures most of the variance in the data, a line such as this one which happens to be very far away from most of the points. Does not capture a lot of the variance, and will thus have a low R-square number.

[Video description begins] A new Weight-Exercise graph displays. This time, the dots are scattered and the line barely touches any of the dots. This graph depicts lowR2 values. [Video description ends]

So once we have gone through a training phase in order to find the best fit linear regression line, we are then ready to use it in order to make predictions on real data. So if you have come up with a straight line,

[Video description begins] Screen title: Regression Used for Prediction. The original Weight-Exercise graph displays with a diagonal line originating from the Y-axis and intersecting as many dots as possible. From the middle of the Regression line, perpendiculars are drawn from a point on the line to either axis. The point on the Weight axis has the following question next to it: What will be the corresponding weight? The point on the X-axis has the following phrase next to it: For some measure of exercise. [Video description ends]

which happens to be the model representing the relationship between the amount of exercise performed by individuals and their weight. If we need to predict the weight of an individual, and we're only given the amount of exercise they get, we can simply plot the amount of exercise on the x-axis. And this is represented by the lower orange dot. We will then extend a vertical line from this point on the x-axis, over to our regression line.

And then, produce a horizontal line all the way to the y-axis, in order to predict the weight of that individual. Since any regression line can be represented by the equation, y is equal to a plus bx, given an input, x, we can calculate the prediction y using that equation.