Simple linear regression is not a misnomer–– it is an uncomplicated technique for predicting a continuous outcome variable, Y, on the basis of just one predictor variable, X. As detailed in previous exercises, a number of assumptions are made so that we can model the relationship between X and Y as a linear function. Using our advertising
dataset, we could model the relationship between the amount spent on podcast advertising in a month and the number of respective products eventually sold as follows:
Where…
Y: represents the dollar value of products sold
X: represents the amount spent on respective product podcast ads
Beta_0: is the intercept, or the number of products sold when no money has been spent on podcasts
Beta_1: is the coefficient, or the slope, of the line representing the relationship
Error: represents the random variation in the relationship between the two variables
To build this model in R, using the standard lm()
package, we use the formula notation of Y ~ X:
model <- lm(sales ~ podcast, data = train)
But wait! Before building this model, we need to split our data into test and training sets For the development of this simple model, we’ll use a standard 60/40 split of our data; where 60% is used to train the model, and 40% is used to test the model’s accuracy and generalizability. We can randomly assign data points to test or training using base R’s sample()
method and list indexing functionality
# specify 60/40 split sample <- sample(c(TRUE, FALSE), nrow(advertising), replace = T, prob = c(0.6,0.4)) # subset data points into train and test sets train <- advertising[sample, ] test <- advertising[!sample, ]
Instructions
First, let’s split our conversion_clean
dataset into 60/40 train/test subsets
- Create a variable named
data_sample
by assigning the result of callingsample()
, withc(TRUE, FALSE)
,nrow(conversion_clean)
, andprob = c(0.6,0.4)
as parameters. - Using list indexing, assign all data points in
sample
to a variable calledtrain
- Using list indexing, assign all data points in not in
sample
to a variable calledtest
Let’s fit a linear model of the relationship between the number of products sold and the number of clicks on the respective product advertisement; this means that conversion
’s total_convert
value, the total number of product purchases by a single user, will be our Y variable, or outcome variable. clicks
the total number of times a user clicks on a version of an ad, will be our X variable, or predictor variable.
Assign the result of calling lm()
, using ~
formula notation to set a linear relationship between total_convert
and clicks
, to a variable called model
. Don’t forget to set the data
parameter equal to train
!