*Simple* linear regression is not a misnomer–– it is an uncomplicated technique for predicting a continuous outcome variable, *Y*, on the basis of just one predictor variable, *X*. As detailed in previous exercises, a number of assumptions are made so that we can model the relationship between *X* and *Y* as a linear function. Using our `advertising`

dataset, we could model the relationship between the amount spent on podcast advertising in a month and the number of respective products eventually sold as follows:

`$Y = beta_0 + beta_1*X + error$`

Where…

**Y**: represents the dollar value of products sold

**X**: represents the amount spent on respective product podcast ads

**Beta_0**: is the intercept, or the number of products sold when no money has been spent on podcasts

**Beta_1**: is the coefficient, or the slope, of the line representing the relationship

**Error**: represents the random variation in the relationship between the two variables

To build this model in R, using the standard `lm()`

package, we use the formula notation of **Y ~ X**:

model <- lm(sales ~ podcast, data = train)

But wait! Before building this model, we need to split our data into test and training sets For the development of this simple model, we’ll use a standard 60/40 split of our data; where 60% is used to train the model, and 40% is used to test the model’s accuracy and generalizability. We can randomly assign data points to test or training using base R’s `sample()`

method and list indexing functionality

# specify 60/40 split sample <- sample(c(TRUE, FALSE), nrow(advertising), replace = T, prob = c(0.6,0.4)) # subset data points into train and test sets train <- advertising[sample, ] test <- advertising[!sample, ]

### Instructions

**1.**

First, let’s split our `conversion_clean`

dataset into 60/40 train/test subsets

- Create a variable named
`data_sample`

by assigning the result of calling`sample()`

, with`c(TRUE, FALSE)`

,`nrow(conversion_clean)`

, and`prob = c(0.6,0.4)`

as parameters. - Using list indexing, assign all data points in
`sample`

to a variable called`train`

- Using list indexing, assign all data points in
*not*in`sample`

to a variable called`test`

**2.**

Let’s fit a linear model of the relationship between the number of products sold and the number of clicks on the respective product advertisement; this means that `conversion`

’s `total_convert`

value, the total number of product purchases by a single user, will be our *Y* variable, or outcome variable. `clicks`

the total number of times a user clicks on a version of an ad, will be our *X* variable, or predictor variable.

Assign the result of calling `lm()`

, using `~`

formula notation to set a linear relationship between `total_convert`

and `clicks`

, to a variable called `model`

. Don’t forget to set the `data`

parameter equal to `train`

!