Simple linear regression is not a misnomer–– it is an uncomplicated technique for predicting a continuous outcome variable, Y, on the basis of just one predictor variable, X. As detailed in previous exercises, a number of assumptions are made so that we can model the relationship between X and Y as a linear function. Using our advertising dataset, we could model the relationship between the amount spent on podcast advertising in a month and the number of respective products eventually sold as follows:

Y=beta0+beta1X+errorY = beta_0 + beta_1*X + error


Y: represents the dollar value of products sold

X: represents the amount spent on respective product podcast ads

Beta_0: is the intercept, or the number of products sold when no money has been spent on podcasts

Beta_1: is the coefficient, or the slope, of the line representing the relationship

Error: represents the random variation in the relationship between the two variables

To build this model in R, using the standard lm() package, we use the formula notation of Y ~ X:

model <- lm(sales ~ podcast, data = train)

But wait! Before building this model, we need to split our data into test and training sets For the development of this simple model, we’ll use a standard 60/40 split of our data; where 60% is used to train the model, and 40% is used to test the model’s accuracy and generalizability. We can randomly assign data points to test or training using base R’s sample() method and list indexing functionality

# specify 60/40 split sample <- sample(c(TRUE, FALSE), nrow(advertising), replace = T, prob = c(0.6,0.4)) # subset data points into train and test sets train <- advertising[sample, ] test <- advertising[!sample, ]



First, let’s split our conversion_clean dataset into 60/40 train/test subsets

  • Create a variable named data_sample by assigning the result of calling sample(), with c(TRUE, FALSE), nrow(conversion_clean), and prob = c(0.6,0.4) as parameters.
  • Using list indexing, assign all data points in sample to a variable called train
  • Using list indexing, assign all data points in not in sample to a variable called test

Let’s fit a linear model of the relationship between the number of products sold and the number of clicks on the respective product advertisement; this means that conversion’s total_convert value, the total number of product purchases by a single user, will be our Y variable, or outcome variable. clicks the total number of times a user clicks on a version of an ad, will be our X variable, or predictor variable.

Assign the result of calling lm(), using ~ formula notation to set a linear relationship between total_convert and clicks, to a variable called model. Don’t forget to set the data parameter equal to train!

Sign up to start coding

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?