Learn how to create and interpret linear regression models in R.

Linear Regression is the workhorse of applied Data Science; it has long been the most commonly used method by scientists and can be applied to a wide variety of datasets and questions. Unlike more recently developed methods in Machine Learning, Linear Regression models can be used to predict new data points and help us understand the relative impact one variable has on another. For example, one well-designed regression model can answer:
- How, and to what extent, does advertising in print media affect the total sales of a product?
- What are the predicted total sales for a product, given the amount spent on print advertisement this month?

In this lesson, you will learn how to harness the malleability and explanatory power of Linear Regression models by following the four primary steps of statistical model building: **confirming data assumptions**, **building a model on training data**, **assessing model fit**, and **analyzing model results**. Using two real-life datasets, `conversion` and `advertising`, this lesson will also focus on the application of regression modeling for use in marketing data science. Let's get started!

Introduction to Linear Regression in R

While the linear regression is perhaps the most widely applied method in Data Science, it relies on a strict set of assumptions about the relationship between predictor and outcome variables. The most obvious (but crucial!) assumption is a **linear relationship** between the predictor and outcome. Following from this assumption is one key observation about any variables we want to include in our model which must be tested before building a model:

**The expected value of the outcome variable is a straight-line function of exclusively the predictor variable**. The best test for this relationship is quite straightforward–– we can just visualize the relationship between the predictor and outcome variables as a scatterplot. A linear relationship will resemble a straight line with a slope not equal to zero, like the relationship between spending on TV ads and the overall sales volume of the related product found in our `advertising` dataset.

![A linear relationship between two features](https://content.codecademy.com/programs/analyze-data-with-r/linear-regression/exercise2.jpeg)

We can also quantitatively test for a linear relationship by computing the **correlation coefficient**. The correlation coefficient is always between positive one and negative one. A coefficient close to `0` (roughly between `-0.20` and `0.20`) suggests a weak linear relationship between two variables. A coefficient closer to positive or negative one suggests a stronger linear relationship. In R, we can compute the correlation coefficient using the `cor.test()` method as follows:
```R
coefficient <- cor.test(advertising$TV, advertising$Sales)
coefficient$estimate
# Output:
0.837
```


Assumptions of Simple Linear Regression

Our next step is to check for outlier data points. Linear regression models also assume that there are no extreme values in the data set that are not representative of the actual relationship between predictor and outcome variables. A box-and-whisker plot is a common method used to quickly determine whether a data set contains any outliers, or data points that differ significantly from other observations in a dataset. An outlier may be caused by variability in measurement, or it might be a sign of an error in the collection of data. 

Regardless, `ggplot`’s `geom_boxplot()` method allows for the easy creation of box-and-whisker plots. To plot the distribution of a single variable, like `advertising$sales`––the total number of sales for a product in a month–– we pass in the same variable as both *x* and *y* in our call to `geom`:

```r
plot <- advertising %>%
  ggplot(aes(sales, sales)) +
  geom_boxplot()
```
![A boxplot showing outliers in Sales data](https://content.codecademy.com/programs/analyze-data-with-r/linear-regression/Exercise3.jpeg)

In this case, it looks like there are a handful of negative `sales` values in the dataset. This is not what we would expect given our understanding of the data; how could an entire market have negative average sales over an entire year? This seems like an error stemming from the collection of this data into a spreadsheet format. In this case, we will filter out these negative datapoints from our dataset using the `filter()` method. We can pass a boolean argument into `filter()` to exclude values that resolve to `false`.

```r
advertising <- advertising %>% filter(Sales > 0)
```

Assumptions of Linear Regression (Outliers)

*Simple* linear regression is not a misnomer–– it is an uncomplicated technique for predicting a continuous outcome variable, *Y*, on the basis of just one predictor variable, *X*. As detailed in previous exercises, a number of assumptions are made so that we can model the relationship between *X* and *Y* as a linear function. Using our `advertising` dataset, we could model the relationship between the amount spent on podcast advertising in a month and the number of respective products eventually sold as follows:
```tex
Y = beta_0 + beta_1*X + error
```
Where…

**Y**: represents the dollar value of products sold

**X**: represents the amount spent on respective product podcast ads

**Beta_0**: is the intercept, or the number of products sold when no money has been spent on podcasts

**Beta_1**: is the coefficient, or the slope, of the line representing the relationship

**Error**: represents the random variation in the relationship between the two variables

To build this model in R, using the standard `lm()` package, we use the formula notation of **Y ~ X**:

```r
model <- lm(sales ~ podcast, data = train)
```
But wait! Before building this model, we need to [split our data into test and training sets](https://www.codecademy.com/articles/training-set-vs-validation-set-vs-test-set) For the development of this simple model, we’ll use a standard 60/40 split of our data; where 60% is used to train the model, and 40% is used to test the model’s accuracy and generalizability. We can randomly assign data points to test or training using base R’s `sample()` method and list indexing functionality

```r
# specify 60/40 split
sample <- sample(c(TRUE, FALSE), nrow(advertising), replace = T, prob = c(0.6,0.4))
# subset data points into train and test sets
train <- advertising[sample, ]
test <- advertising[!sample, ]
```



Building a Simple Linear Model

Once we have an understanding of the kind of relationship our model describes, we want to understand the extent to which this modeled relationship actually fits the data. This is typically referred to as the goodness-of-fit. In simple linear models, we can measure this quantitatively by assessing two things:
1. Residual standard error (RSE)
2. R squared (R^2)

The RSE is an **estimate of the standard deviation of the error** of the model (*error* in our mathematical definition of linear regression). Roughly speaking, it is the average amount that the response will deviate from the true regression line. We get the RSE at the bottom of `summary(model)`, we can also get it directly with

```r
sigma(model)

#output
3.2
```

 An RSE value of 3.2 means the actual sales in each market will deviate from the true regression line by approximately 3,200 units, on average. Is this too large of a deviation? Well, that’s subjective, but when compared to the average value of sales over all markets the percentage error is 22%:

```r
sigma(model)/mean(train$sales)

# output
[1] 0.2207373
```

**The RSE provides an absolute measure of lack of fit of our model to the data**. But since it is measured in the units of Y, it is not always clear what constitutes a good RSE. 

The R^2 statistic provides an alternative measure of fit. It represents the **proportion of variance explained**, so it always takes on a value between 0 and 1, and is independent of the scale of Y, our outcome variable. Similar to RSE, the R^2 can be found at the bottom of `summary(model)` but we can also extract it directly by calling `summary(model)$r.squared`. The result below suggests that podcast advertising budget can explain 64% of the variability in the total `sales` value.

```r 
summary(model)$r.squared

# output
[1] 0.6372581
```


Quantifying Model Fit

Great! We can build a model! But... how do we know if it's any good? Also, if another data scientist builds a different model using a different independent variable, how can we tell which model is "best"? Even within Statistics, "best" can be a subjective qualifier. However, scientists who use regression models generally agree that the best model is the one that minimizes the distance between a data point and the estimation line drawn by a model. **The vertical distance between a datapoint and the line estimated by a regression model is called a residual**; residuals and their aggregations are the fundamental units of measures of regression model fit and accuracy.

Because residuals are based on cartesian distances, it often helps to visualize their values. For instance, consider the plot of a simple linear regression alongside its training data below. Note one point is 4 units above the regression estimate line; in this example, the residual for that point is 4. Meanwhile, another point is 2 units below the regression estimate line; the residual for that point is -2. A data point is best fit by the model which results in the smallest residual for that point. 

![residual graph](https://content.codecademy.com/programs/analyze-data-with-r/linear-regression/residualArrow.svg)

When scientists make quantitative arguments for a best fit model, they rely on an aggregation, often the sum or average, of residual values across an entire dataset. While is it easy to be overwhelmed by the variety of measures used to argue that one model is better than the other, it is crucial to realize that all measures are grounded in the simple difference between regression estimate and observed data point. Let's produce a visualization of our own model of `clicks` on `total_convert` to better understand our model residuals.

Checking Model Residuals

In addition to the quantitative measures that characterize  our model accuracy, it is always a best practice to produce visual summaries to assess our model. First, we should always visualize our model within our data. For simple linear regression this is quite simple; we can use `geom_point()` to plot our observed values, and `geom_smooth(method = "lm")` to plot our model. In addition, we can include a second call to `geom_smooth()`, with parameters `(se = FALSE, color = "red")`. This combination of function calls allows us to compare the linearity of our model, visualized below as the blue line with the 95% confidence interval covering the shaded region, in comparison to a  non-linear LOESS smoother visualized in red. 

```r
ggplot(train, aes(podcast, sales)) +
geom_point() +
geom_smooth(method = "lm") +
geom_smooth(se = FALSE, color = "red") 
```
![A linear regression model with a LOESS smoother](https://content.codecademy.com/programs/analyze-data-with-r/linear-regression/Exercise7.jpeg)

LOESS smoothers plot a line based on the weighted value of data points; the line produced by a LOESS smoother is similar to taking a moving average of data points as our x-axis variable increases. The smoother should not be used to predict new values, as it [relies heavily on our training data](https://www.codecademy.com/articles/the-dangers-of-overfitting), but it is a helpful tool for visualizing where our linear model diverges from our training data.

Considering the LOESS smoother remains within the confidence interval of our model, we can assume the linear trend fits the essence of this relationship. However, we should note that as the podcast advertising budget gets closer to 0 there is a stronger reduction in sales beyond what the linear trend follows; this means that our model might be less accurate in instances where the podcast budget is very low.


Visualizing Model Fit

Ready for the real fun? We've done our due diligence and confirmed that our data fulfills the assumptions of simple linear regression models; we've split our data into test and training subsets, and properly built a model using `y ~ x + b` notation; we've even taken the time to assess the fit of our model using both quantitative and qualitative approaches. Now we can finally analyze the results of our model and discover the relationship between user advertisement clicks and the purchase rate of related products!

We can view the results of a linear regression model in R by calling `summary()` on the `model` variable to which we saved the results of our call to `lm()`. The `summary()` function will print out *a lot* of information about our model–– but don't be overwhelmed! There are four primary sections of quantitative results that are crucial to interpreting regression models:

**Call** 

This section simple displays the call to `lm()` which created these model results. It's a helpful reminder of which version of a outcome-predictor pair is currently under analysis.

**Residuals**

As covered in our earlier exercises, a **residual** is the difference between the value of an outcome variable predicted by the model and the actual observed value of the variable. The `summary()` output displays a set of numbers that summarize the distribution of residuals in our model, including minimum/maximum residual values, the values first/third quantiles, and the median residual value for the model. We've already analyzed our residual values by creating a plot in an earlier exercise, but these summary values are a helpful reminder of the overall spread of our model errors.

**Coefficients**

*Estimate*

Coefficients are most important results in the interpretation of regression models. The number you see in the `Estimate` column, (a value of `0.048939` for `clicks`) is called a **regression coefficient**. Looking back to formal definition of a linear regression model: 

```tex
Y = beta_0 + beta_1*X + error
```
The regression coefficient is represented by the `beta_1` variable. This linear regression equation tells us that the regression coefficient represents the expected change in the dependent variable (in our case `total_convert`) for a one-unit increase in the independent variable (`clicks`). In other words, for every additional click on an advertisement, the expected sales of a related product are estimated to increase by `0.049` dollars. In addition to the size of the coefficient, it is also important to note the sign of the coefficient. If our `clicks` coefficient was negative, our model would be estimating that the sales of a product actually *decreases* every time an advertisement is clicked.

*Std. Error*

The column adjacent to `Estimate` is called `Std. Error`; the **standard error** of each coefficient is the estimate of the standard deviation of the coefficient.  It is crucial to note that the standard error is not a quantity of interest by itself, but depends on the value of our regression coefficient

*T-value and Pr(>|t|)*

The `t value` and `Pr(>|t|)` inherently answer the same question–– given the value of our variable's regression coefficient and its' standard error, does the variable explain a significant part of the change in our outcome variable? However, the `Pr(>|t|)` column purposely provides a more concise response to this question, using the asterisk notation that corresponds with the `Signif. codes` legend at the bottom of the Coefficients results section. In R model output, one asterisk means “p < .05”. Two asterisks mean “p < .01”; and three asterisks mean “p < .001”. These values are referred to as **p-values** in scientific literature. How can we use p-values to answer our question around model significance? 
	
Asterisks in a regression table indicate the level of the **statistical significance** of a regression coefficient. Our understanding of statistical significance is based off of the idea of a random sample. When interpreting these asterisk values, we ask ourselves: if there truly is no relationship between clicks on an advertisement and product sales, then what are chances that, across many user clicks on an ad, we see behavior that suggests that there is no relationship?

For our `clicks` variable, with `***` in the `Pr(>|t|)` column, the answer is *very* unlikely. The value of `***`, or `p < .001`, means that random sample resulting in the regression coefficient and standard error that we observed for `clicks`-given that there was truly no difference relationship between `clicks` and product purchase—would occur in less than one time in a random draw of 100, on average. Given that we would so rarely observe the situation that suggests that there is no relationship between our `click` and `total_convert`, we can say that there is a statistically significant relationship between the two variables. Generally speaking, **scientists accept that a variable coefficient with p-value less than 0.05 is statistically significant**.

**Measures of Model Fit**

At the bottom of the output of `summary()` are a series of labeled metrics, like Residual Standard Error (RSE) and Multiple R-squared, which quantify the fit of our model. Our previous exercises have covered how to interpret and plot many of these measures, but it's a helpful reminder for them to be summarized along with other model output.

Wow! There is so much information conveyed within a simple call to our model results. Continue on to the second half of this lesson to practice the interpretation of model results, and learn more about how to select a best fit model.

Reading Model Results

Let's practice our model interpretation skills! We know that for continuous independent variables, like `podcasts`, the regression coefficient represents the difference in the predicted value of `sales` for each one-dollar increase in `podcasts`. Given the output of calling `summary(model)` below, we can correctly say that for every one dollar increase in podcast advertisement spending, total sales of the related project increases by 1.742 dollars.

```r
summary(model)

#output 
Call:
lm(formula = sales ~ radio, data = train)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.57927    0.91176  10.506  < 2e-16 ***
podcast      1.74240    0.03255   5.353 3.71e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

We could also extract the value of the `podcast` coefficient, the second coefficient returned by our model, using list indexing as follows:

```r
podcast_coefficent <- model$coefficients[2] 
```


It is important to note that the interpretation of the intercept coefficient is slightly different from that of variable coefficients. The intercept coefficient represents the value we would predict for our outcome variable, `sales`, given that podcast spending is equal to zero. **It is crucial to remember that the intercept coefficient is only interpretable if we can reasonably expect a zero value for all independent variables in a model.** Assuming, just as our simple linear model does, that spending on podcasts is the only variable that explains changes in `sales`, it does not make sense for any sales to occur without podcast spending. Therefore, for this model, our intercept coefficient is not interpretable. 

However, in many cases, intercept coefficients are interpretable! As you've seen throughout this lesson, the analysis of any model results requires a thorough understanding of our data, the system that produces this data, and a critical approach to interpretation of coefficient values.  

Assessing Simple Linear Regression

Data Scientists are often interested in building models to make predictions on new data. While the `add_predictions()` function from the `modelr` package makes it easy to predict new values from a technical standpoint, it is far more difficult to develop and assess accurate predicted values.

The most common metric used to compute the accuracy of predicted values is mean squared error on test data. Similar to residual squared error (RSE) and R-squared, MSE measures the average squared difference between predicted and observed values. When we are working with just one model, it is helpful to compare the difference between MSE on our training dataset, and MSE on test data. We can calculate training MSE for model using a combination of `add_predictions()` and `summarise()`. `add_predictions()` creates and adds predicted values from a model to a column called `pred`. `summarise()` then allows us to calculate the mean of the squared difference between our observed values (`sales`) and predicted values (`pred`).

```r
train %>% 
  add_predictions(model) %>%
  summarise(MSE = mean((sales - pred)^2))

#output

       MSE
  31.60713
```

We can use the same combination of functions to calculate MSE for our test dataset, which results in a MSE of around 32.5. Testing MSE will almost always be higher than training MSE, as the model has been built off of training data; however, it is important to confirm that there is not a substantial difference between model training and test MSE. The value of using MSE to quantify prediction accuracy is more clear when comparing multiple models, as it allows us to determine which versions of a model best predicts an outcome variable. For instance, we could compute the MSE for a model of tv spending on sales.

```r
model2 <- lm(sales ~ tv, data = train)

train %>% 
  add_predictions(model2) %>%
  summarise(MSE = mean((sales - pred)^2))

#output
    MSE
    27.28415
```

Comparing the train MSE for our tv-based model, at 27.28, to our train MSE for a podcast-based model, at 31.60, it is clear that the predictions from the tv-based model are more accurate, as the model's MSE is lower. If a data scientist was trying to predict the expected volume of sales for a future business quarter, it would be a better idea for them to base their estimations off of a tv-based model.

Making Predictions

We've been able to really dig into the results of simple linear regression models and show how the results convey a substantial amount of information about the relationship between two variables. However, by this point you might be wondering–– what if I think variables *other* than `podcast` have contribute to the total sales of a product? You might remember that the primary assumption behind simple linear models is that the expected value of the outcome variable is a straight-line function of *exclusively* the predictor variable. This means that our simple linear models assume that *all* variation in the outcome variable is explained by the predictor variable. In the case of our `sales` dataset, we know this is almost certainly not true; oftentimes more money is spent on TV or newspaper ads than on podcasts, so this spending might have an even larger effect than podcast spend. 

Thankfully, there are methods to include the effects of `TV` and `newspaper` in linear regression models. We can expand our model definition from a *simple* model of one predictor variable to a *multiple* model of, you guessed it, *multiple* predictor variables. The formal definition of multiple linear regression models is a direct extension of the formula for simple linear regression:

```tex
Y = beta_0 + beta_1*X  + beta_2*X  + error
```  

As in a simple linear model, **Y** represents the dollar value of products sold, **X** represents the amount spent on respective product podcast ads, and **Beta_0** is the model intercept. Now, **Beta_1**, **Beta_2**, and **Beta_3** represent each the coefficients of predictor variables. To build a similar model in R, using the standard `lm()` package, we still use the formula notation of **Y ~ X**:

```r
model <- lm(sales ~ podcast + tv, data = train)
```

While building a multiple regression model is a straightforward extension of the code used to a build a simple model and the output of the model results below looks quite similar, a bit more effort goes into the interpretation of the results of this model. Remember that in a simple linear regression model, the regression coefficient represents the expected change in the dependent variable for a one-unit increase in the independent variable. In other words, the coefficient for `podcast` represents the expected increase in `sales` given a one dollar increase podcast advertisement spend. Because multiple linear regression includes more than one predictor variable, the coefficient estimates must be interpreted differently. In multiple linear regression, the regression coefficient represents the expected change in the dependent variable for a one-unit increase in the independent variable, *holding all other variables in the model constant*. Expand the width of your narrative panel to view the output of the multiple linear regression model below:

```r
summary(model)

#output
Call:
lm(formula = sales ~ TV + podcast, data = train)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.583386   1.024616   4.473 1.65e-05 ***
TV          3.006340   1.004924   7.380 1.62e-11 ***
podcast     1.049249   1.027665   5.395 3.10e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

For example, a call to `summary(model)` shows that the coefficient for `podcasts` is equal to `1.04944`. This means that, when one more dollar is spent on podcast advertising, about `1.049` more dollars of the related product is sold, *given that there is no increase in the amount of money spent on tv advertisements*. In this way, multiple linear regression models allow us to isolate the unique effect of one predictor variable on the outcome variable.

As this example shows, the selection of variables in a regression model can have wide-ranging impacts on the results and interpretation of our models! Let's dive into one more exercise to practice building and interpreting multiple linear regression models.

The difference between simple and multiple linear regression.

Multiple Linear Regression

Time to pull it all together! The interpretation of coefficents in multiple linear regression is slightly different than that of coefficents in simple linear regression. Coefficent of independent continunous variables, like `podcasts`, represents the difference in the predicted value of sales for each one-dollar increase in podcasts, given that *all* other variables in the model, including `tv`, are held constant. Given the output of calling `summary(model)` below, we can correctly say that for every one dollar increase in podcast advertisement spending, while holding the amount spent on `tv` and `newspaper` constant, the total sales of the related product increases by 1.049 dollars.

```r
summary(model)

#output
Call:
lm(formula = sales ~ TV + podcast + newspaper, data = train)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.583386   1.024616   4.473 1.65e-05 ***
TV          3.006340   1.004924   7.380 1.62e-11 ***
podcast     1.049249   1.027665   5.395 3.10e-07 ***
newspaper   1.006340   1.002924   6.380 1.12e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

In addition, the interpretation of boolean categorical variables differs slightly from that of continous variables. The coefficent value associated with a boolean categorical variable represents the effect of changing from one category to another. for instance, the coefficient value of 1.006 for `newspaper` tell us that running print advertisements results in a 1.006 dollar increase in `sales`, holding the values of `TV` and `podcast` constant. 

As we've suggested throughout this lesson, data scientists often build many variations of a model with different combinations of independent variables before ultimately commiting to the model that best fits test data. Let's practice building, interpreting, and selecting the best fit multi-linear model for our `convert_clean` dataset!

Assessing Multiple Linear Regression

Whew, that's a wrap! You've covered *a lot* of material related to linear regression and its implementation in R. Here are the main concepts we've covered:

- Statistical model building entails four main steps: 1) **confirming data assumptions**, 2) **building a model on training data**, 3) **assessing model fit**, and 4) **analyzing model results**.

- We can use a combination of qualitative methods, such as box-plots, and quantitative methods, like the **correlation coefficient**, to assess that data meets our assumptions 

- We use the `lm()` method and `Y ~ X` notation to build a linear regression model. The `Y` variable is referred to as the **outcome variable** of the model, and any `X` variable is referred to as the **predictor variable**.

- We can use a similar set of qualitative and quantitative methods to evaluate the fit of our model, including a comparison of the plotted model to a LOESS smoother and statistics like **mean squared error** (**MSE**) and **R-squared**.

- MSE and R-squared statistics are summaries of the overall value of the model residuals. A **residual** is the difference between the value of a data point predicted by a model and its actual observed value.

- The results of a linear regression model include **regression coefficients**. These coefficients represent the effect their respective predictor variable has on the model's outcome variable.

- In a simple linear regression, the regression coefficient represents the effect of a one-unit increase in the predictor variable.

- The **intercept coefficient** represents the value of the outcome variable given that the predictor variable is equal to zero; this coefficient isn't always meaningful and depends on the situation being modeled.

- The **p-value** associated with a regression coefficient helps us understand whether the effect of a variable is statistically significant.

- **Multiple linear regression** is similar to simple regression, except that it includes multiple predictor variables.

- In a multiple linear regression, the regression coefficient represents the effect of a one-unit increase in the respective predictor variable, given that all other predictor variables are held constant.

- In both simple and multiple linear regression, boolean categorical variables represent the total effect of switching from one category to another.

Review

Linear Regression in R

Dip your toes into the world of Machine Learning by learning how to build and interpret linear regression models in R. Learn how a company might make strategic advertisement decisions based on the output from a linear regression model. In the off-platform project, you will explore several data sets that come build-in to R. 

Learn about the difference between simple linear regression and multiple linear regression.

Linear Regression In R

 **What is regression analysis?**
	
We often hear of new, complex “machine learning” methods that allow us to generate human language, very accurately predict changes in the stock market, or recognize that an image contains a person or specific object. While some of these amazing applications are results of genuinely new methodological developments, more frequently, these applications rely on the savvy use of a classic statistical technique–– regression model analysis. The first regression model was specified by Adrien-Marie Legendre, a French mathematician, in 1805, and regression-based modeling has been the cornerstone of applied statistics ever since! 

Regression analysis is a group of statistical methods that estimate the relationship between a **dependent variable** (otherwise known as the outcome variables) and one or more **independent variables** (often called predictor variables). The most frequently used regression analysis is **linear regression**, which involves requires a scientists to find a line of **“best fit”** which most closely traces the data according to a certain mathematical criterion.

Unlike many other models in Machine Learning, regression analyses can be used for two separate purposes. First, in the social sciences, it is common to use regression analyses to infer a causal relationship between a set of variables; second, in data science, regression models are frequently used to predict and forecast new values. Therefore, we can use regression analysis to answer a wide variety of questions, including the examples below.

- Is the relationship between two variables linear?
- Is there even a dependency between two variables?
- How strong is the relationship?
- Which variable contributes the most to the outcome measurement?
- How accurately can we predict future values?
- Is our outcome variable caused by another variable?

**Estimating Coefficients**

Regression analysis answers all these questions and more by estimating **regression coefficients** for every predictor variable used in a model. Let’s assume that we are analyzing a simple model that is made up of just one predictor, one outcome variable and one coefficient. This model is formally represented as follows:

```tex
Y = B_0 + B_1*x + error
```

    

In this representation, the **beta values** (`B_0` and `B_1`) represent the regression coefficients. In high school, we are often taught that a line can be formally represented as follows:

```tex
Y = m*x + b
```

Our simple regression model follows the same formula. Except,  in our case the **m**, or **slope** value, is represented by B_1 and the **b**, or intercept value, represented by **B_0**. Regression analysis finds the best fit values for B_0 and B_1 and allows for us to describe the relationship between two variables using the resulting equation of the best fit line.

So how does a regression model find these beta values? In most cases, the best fit line is agreed to be the most that minimizes the **residual error**. A model will never perfectly fit the data, so there will always be some error, or difference between the actual observed data value and the value predicted by a model. This difference is commonly referred to as a **residual**, and is  formally represented as follows: 

```tex
Error_i = y_i - \hat{y}_i
```

A very naive approach model for data prediction would just predict that the value of a new data point will equal the average of all observed data. This model would look like the example below, where the model’s predictions are represented by a red line:

![ ](https://content.codecademy.com/programs/analyze-data-with-r/linear-regression/article1.png)

This model doesn’t seem like it fits our data very well. If we drew a line from our observed data points to the prediction line, as shown below, then added up the absolute length of all these lines, our **sum of residual error** would be equal to 9. Any line that has a lower sum of residual error than our naive model would be considered a better fit line. 

![ ](https://content.codecademy.com/programs/analyze-data-with-r/linear-regression/article2.png)

However, it is important to note that regression models use the **sum of *squared* error (SSE)**; in our example above, our naive model has a SSE of 13.92. We use squared error because a prediction can either be greater than or less than the actual value, producing a positive or negative error value. We avoided this issue in our calculation above by taking the absolute difference between observed and predicted values. But, If we did not square the error values or take their absolute difference, the sum of errors could decrease because of negative error values, not because a model fits the data best. While absolute difference solves this issue, it is far more common to fit a model based on squared errors. The square function adds a greater penalty to predicted values that are very incorrect, as squaring larger values results in a respectively larger result than squaring small values, this helps bias the model fitting calculation towards a line that produces accurate predictions for *all* values in a dataset, rather than a handful of most common observations. The example below shows a line fit by selecting the beta coefficients that have minimized SSE to 5.38. Doesn’t that line look better?

![ ](https://content.codecademy.com/programs/analyze-data-with-r/linear-regression/article3.png)

**Assess Model Fit**

While theoretically the sum of square error gives us the best fit model given the data at hand, it doesn’t guarantee that this best model is actually a “good”, or relatively accurate, model. There are a  number of statistics used to summarize model fit, but one of the most common measures is **R-squared**, sometimes called the coefficient of determination. R-squared quantifies the **proportion of the variability in the outcome variable that can be explained by a predictor variable**. More simply, it summarizes the difference between the sum of squared error for a line which is simply the average of all Y values, or outcome variables, and the sum of squared error for our proposed model. It can be formally represented as follows:

```tex
SS_{total} = sum_i ( y_i - y^hat_i)^2
```

```tex
SS_{residual} = sum_i ( y_i -  f(x_i))^2
```

```tex
R^2 = 1- SS_{residual}/SS_{total}
```

In this case, SS_total represents the squared difference between the actual data and the average value of all Y values, and SS_residual represents the squared difference between the actual data and the values predicted by our model. We saw in our example model above that the SS_total for our model was 13.92, and the SS_residual was 5.38. The R-squared calculation is simply one minus the proportion of our total and residual sum of squares; plugging in the values from our example, we get an R-squared value of 0.61:

```tex
R^2 = 1- 5.38/13.92
```
```tex
0.61 = 1 - 0.3864
```

This means that if predictor variable X can predict the outcome variable with consistent accuracy, then the proportion is high and the R-squared value will be close to 1. If the opposite is true, the R-squared value is closer to 0. We can provide a narrative around the R-squared value by saying that: “ [R-squared value] percent of the variation in [outcome variable] is explained (or predicted) by [predictor variable]”. In our example. 61% of the variation our Y value is explained by our X value. That’s pretty good!

This is just the beginning of regression analysis! For the past two centuries researchers have been developing innovative approaches to regression-based analysis. By building up your theoretical and technical understanding of regression models, you too can join the ranks of statisticians utilizing the most widely and consistently applied methods to better understand our world. 







This article is a brief introduction to the formal theory (otherwise known as Math) behind regression analysis.

Introduction to Regression Analysis

`model` is a multiple linear regression model that predicts `income` based on a combination of `gender`, `occupation` and `home_zipcode`. 

`model` is a simple linear regression model that predicts `income` based on a combination of `gender`, `occupation` and `home_zipcode`. 

`model` is a multiple linear regression model that uses `income` to predict `gender`, `occupation`, and `home_zipcode`.

model <- lm(income ~ gender + occupation + home_zipcode, data = train)

`model_1` best fits the data, as its' r-squared value is larger than that of `model_2`. `model_1` explains 82% of the variation in the outcome variable, while `model_2` explains 67% of the variation in the outcome variable.

`model_2` best fits the data, as its' r-squared value is smaller than that of `model_1`. `model_1` explains 82% of the variation in the outcome variable, while `model_2` explains 67% of the variation in the outcome variable.

`model_1` best fits the data, as its' r-squared value is larger than that of `model_2`. `model_1` explains 18% of the variation in the outcome variable, while `model_2` explains 33% of the variation in the outcome variable.

`model_2` best fits the data, as its' r-squared value is smaller than that of `model_1`. `model_1` explains 18% of the variation in the outcome variable, while `model_2` explains 33% of the variation in the outcome variable.

There is a strong positive linear relationship between `sales` and `tv`.

There is a weak positive linear relationship between `sales` and `tv`.

There is a strong negative linear relationship between `sales` and `tv`.

There is a weak negative linear relationship between `sales` and `tv`.

`newspaper` is statistically significant. When a product is advertised in newspapers, there is an estimated $120 increase in `sales`.

`newspaper` is statistically significant. For every dollar spent on advertisement in newspapers, there is an estimated $120 increase in `sales`.

`newspaper` is not statistically significant. For every dollar spent on advertisement in newspapers, there is an estimated $120 increase in `sales`.

`newspaper` is not statistically significant. When a product is advertised in newspapers, there is an estimated $120 increase in `sales`.

Test your knowledge of the theory behind linear regression and the R coding skills required to implement regression models.

Linear Regression in R Quiz

In this project, you'll use linear regression and national survey data to predict the income of an individual based off of social characteristics like age, gender, and education.

First, let's checkout the structure of our dataset and confirm it contains all the variables described in the introduction. Call `str()` on dataset, which has been saved to a variable called `psid`.

Create a bar chart that plots the distribution of `age` in our `psid` dataset using `ggplot`'s `geom_bar`; do any of the observed values seem unrealistic?

Considering we are interested in predicting the **labor income** of survey respondents, it would be reasonable to filter our data so that it only includes respondents of working age, roughly between 18 and 75. Use `dplyr`'s `filter()` method to exclude observations with `age < 18` or `age > 75`; save the result to a variable called `psid_clean`.

Let's confirm that our call to `filter()` properly truncated our data. Create another bar chart of `psid`'s `age` column using `geom_bar`.

Now let's perform the same check for the education level of respondents. Create a boxplot of the distribution of `education_years` using `geom_boxplot()`.

Again, some of the values in our observed data do not make much sense-- how could someone achieve 100 years of formal education? Let's use `filter()` to limit our dataset to observations where `education_years` is between `5` and `25`.

Finally, let's check our outcome variable, `labor_income`. Create a boxplot of the distribution of `labor_income` using `geom_boxplot()`. 

Hmm, the scales on our `labor_income` boxplot are difficult to interpret; let's take a look at a quantitative representation of the variable distribution using by calling `summary()` on `labor_income`. What does this output tell us about the distribution of income in our dataset?

Let's make sure we fully understand how this highly skewed income distribution relates to our key variables in our dataset. Create a scatterplot of average income by age using `dplyr`'s `group_by` and `summarise()` functions, along with `ggplot`'s `geom_point()`.

Which ages seem most likely to have zero `labor_income`?  

Now we can build a model! Let's specify our training and test datasets. 
- Change the dataset reference within `sample()` from `psid` to `psid_clean`.
- Using the `sample` variable already defined in the workspace, create a `train` data frame that includes all observations in `sample`, then create a `test` data frame that includes all observations **not** in `sample`.

Build a simple linear model that regresses `education_years` on `labor_income`. Don't forget to use our `train` dataset!

Let's compare our model against an LOESS smoother to see where our model predictions differ most substantially from the average observed value. 
- Pass in `aes(education_years, labor_income)` into a call to `ggplot()`; don't forget to use our `train` dataset.
- Add a call to `geom_point()` to plot our observed values
- Add a call to `geom_smooth()`, passing in  `method = "lm"` as a parameter.
- Add a call to `geom_smooth()` passing in `se = FALSE` and `color = "red"` as parameters.

How closely does our model align with the LOESS smoother?

Using the R-squared metric, quantify the fit of `model`. Extract `r.squared` from our model output, multiply the result by `100`, and save the result to a variable called `r_sq`.

Uncomment the following f-string to see how we would provide a narrative around `model`'s R-squared value.

While our `model` was a good start, it seems reasonable to expect that other variables in our dataset, like `age` or `gender`, also have an impact on `labor_income`. Build a second model, `model_2`, which regresses `education_years`,  `age`, and `gender` on `labor_income`. Don't forget to use our `train` dataset.

Using the R-squared metric, quantify the fit of `model_2`. Extract `r.squared` from our model output, multiply the result by `100`, and save the result to a variable called `r_sq_2`.

Uncomment the following f-string to see how we would provide a narrative around `model_2`'s R-squared value.

Let's also provide a graphic representation of our `model_2` fit by plotting the observed and predicted values of `labor income using `add_predictions()` from the `modelr` package:
- Using our `test` data, add a call to `add_predictions()`, passing in `model_2`
- Add a call to `ggplot()`, passing `age` and `labor_income` as parameters to `aes()`
- Add a call to `geom_point()` to plot our observed values
- Add a call to `geom_line()`, explicitly setting in `pred` as a `y` value in `aes()`, and passing in `color = "blue"` as a parameter.


With `model_2` as a substantial improvement over `model`(based on R-squared score), let's dive into results! Call `summary()` on `model_2` and take a first look at the coefficient values. Try and answer the following questions:
- Do `education_years`,  `age`, and `gender` all have a significant impact on `labor_income` ?
- `gender` is a boolean categorical variable; how should we interpret its' coefficient value?
- Which variable has the largest effect on `labor_income`?

Extract the value of the `education_years` coefficent and assign the result to a variable called `education_coefficent`. 

Uncomment the following f-string to see how we would provide a narrative around `model_2`'s `education_years` coefficient.

Great work! You've successfully cleaned and analyzed a real-world dataset, built and iterated towards a best-fit model, and provided an interpretation of your results! 

Feel free to keep trying different combinations of predictor variables to see if you can improve upon `model_2`.

Predicting Income with Social Data

### Why Learn Linear Regression in R? 

This course is an introduction to the topic of linear regression and how to implement them using the R programming language.  Linear regression models are used in machine learning, so this course serves as an introduction to the topic as well. R is used by professionals in the Data Analysis and Data Science fields as part of their daily work. 



### Take-Away Skills 
In this course, you will learn how to make linear regression models using R. In addition to learning how to make the model, you will also learn how to interpret it. This is almost more critical than making the model itself — being able to communicate the findings that you get from your model is an essential skill of a data scientist.


Learn about the difference between simple linear regression and multiple linear regression in R

Learn Linear Regression with R

PRO SALE: Get 50% off annual Pro memberships using code [LLM50](https://www.codecademy.com/checkout?plan_id=proGoldAnnualV2&discountCode=LLM50&plan_type=pro)