Simple linear regression is not a misnomer–– it is an uncomplicated technique for predicting a continuous outcome variable, Y, on the basis of just one predictor variable, X. As detailed in previous exercises, a number of assumptions are made so that we can model the relationship between X and Y as a linear function. Using our advertising dataset, we could model the relationship between the amount spent on podcast advertising in a month and the number of respective products eventually sold as follows:

Y=beta0+beta1X+errorY = beta_0 + beta_1*X + error


Y: represents the dollar value of products sold

X: represents the amount spent on respective product podcast ads

Beta_0: is the intercept, or the number of products sold when no money has been spent on podcasts

Beta_1: is the coefficient, or the slope, of the line representing the relationship

Error: represents the random variation in the relationship between the two variables

To build this model in R, using the standard lm() package, we use the formula notation of Y ~ X:

model <- lm(sales ~ podcast, data = train)

But wait! Before building this model, we need to split our data into test and training sets For the development of this simple model, we’ll use a standard 60/40 split of our data; where 60% is used to train the model, and 40% is used to test the model’s accuracy and generalizability. We can randomly assign data points to test or training using base R’s sample() method and list indexing functionality

# specify 60/40 split sample <- sample(c(TRUE, FALSE), nrow(advertising), replace = T, prob = c(0.6,0.4)) # subset data points into train and test sets train <- advertising[sample, ] test <- advertising[!sample, ]



First, let’s split our conversion_clean dataset into 60/40 train/test subsets

  • Create a variable named data_sample by assigning the result of calling sample(), with c(TRUE, FALSE), nrow(conversion_clean), and prob = c(0.6,0.4) as parameters.
  • Using list indexing, assign all data points in sample to a variable called train
  • Using list indexing, assign all data points in not in sample to a variable called test

Let’s fit a linear model of the relationship between the number of products sold and the number of clicks on the respective product advertisement; this means that conversion’s total_convert value, the total number of product purchases by a single user, will be our Y variable, or outcome variable. clicks the total number of times a user clicks on a version of an ad, will be our X variable, or predictor variable.

Assign the result of calling lm(), using ~ formula notation to set a linear relationship between total_convert and clicks, to a variable called model. Don’t forget to set the data parameter equal to train!

Sign up to start coding

By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.
Already have an account?