While the linear regression is perhaps the most widely applied method in Data Science, it relies on a strict set of assumptions about the relationship between predictor and outcome variables. The most obvious (but crucial!) assumption is a **linear relationship** between the predictor and outcome. Following from this assumption is one key observation about any variables we want to include in our model which must be tested before building a model:

**The expected value of the outcome variable is a straight-line function of exclusively the predictor variable**. The best test for this relationship is quite straightforward–– we can just visualize the relationship between the predictor and outcome variables as a scatterplot. A linear relationship will resemble a straight line with a slope not equal to zero, like the relationship between spending on TV ads and the overall sales volume of the related product found in our `advertising`

dataset.

We can also quantitatively test for a linear relationship by computing the **correlation coefficient**. The correlation coefficient is always between positive one and negative one. A coefficient close to `0`

(roughly between `-0.20`

and `0.20`

) suggests a weak linear relationship between two variables. A coefficient closer to positive or negative one suggests a stronger linear relationship. In R, we can compute the correlation coefficient using the `cor.test()`

method as follows:

coefficient <- cor.test(advertising$TV, advertising$Sales) coefficient$estimate # Output: 0.837

### Instructions

**1.**

Load the `conversion.csv`

into the working environment using `read.csv()`

. Save the result to a variable called `conversion`

, and don’t forget to set the `header`

parameter to `TRUE`

!

**2.**

A good statistical workflow *always* involves a thorough understanding of the data available to model and a qualitative analysis of relevant variables. Use `str()`

to write out the structure of the dataset and list of variables types. Which variables seem like possible predictors of purchase, or `total_convert`

?

**3.**

Use a combination of the base `ggplot()`

function and `geom_bar()`

to plot the distribution of the `clicks`

variable, a measure of how many times a user clicked on an advertisement. Save the result to a variable called `clicks_dist`

. Call `clicks_dist`

.

**4.**

Take a closer look at the `clicks_dist`

visualization. What is the approximate range of the `clicks`

variable? What seems like the most common value (otherwise called the *mode*) of `clicks`

? Set `clicks_mode`

equal to approximate value of the `clicks`

mode.

**5.**

Assign the result of calling `cor.test()`

, with `conversion$total_convert`

and `conversion$clicks`

as input parameters, to a variable called `correlation`

. Print out `correlation$estimate`

. Does the coefficient value suggest that the variables have a linear relationship?