While the linear regression is perhaps the most widely applied method in Data Science, it relies on a strict set of assumptions about the relationship between predictor and outcome variables. The most obvious (but crucial!) assumption is a linear relationship between the predictor and outcome. Following from this assumption is one key observation about any variables we want to include in our model which must be tested before building a model:
The expected value of the outcome variable is a straight-line function of exclusively the predictor variable. The best test for this relationship is quite straightforward–– we can just visualize the relationship between the predictor and outcome variables as a scatterplot. A linear relationship will resemble a straight line with a slope not equal to zero, like the relationship between spending on TV ads and the overall sales volume of the related product found in our advertising
dataset.
We can also quantitatively test for a linear relationship by computing the correlation coefficient. The correlation coefficient is always between positive one and negative one. A coefficient close to 0
(roughly between -0.20
and 0.20
) suggests a weak linear relationship between two variables. A coefficient closer to positive or negative one suggests a stronger linear relationship. In R, we can compute the correlation coefficient using the cor.test()
method as follows:
coefficient <- cor.test(advertising$TV, advertising$Sales) coefficient$estimate # Output: 0.837
Instructions
Load the conversion.csv
into the working environment using read.csv()
. Save the result to a variable called conversion
, and don’t forget to set the header
parameter to TRUE
!
A good statistical workflow always involves a thorough understanding of the data available to model and a qualitative analysis of relevant variables. Use str()
to write out the structure of the dataset and list of variables types. Which variables seem like possible predictors of purchase, or total_convert
?
Use a combination of the base ggplot()
function and geom_bar()
to plot the distribution of the clicks
variable, a measure of how many times a user clicked on an advertisement. Save the result to a variable called clicks_dist
. Call clicks_dist
.
Take a closer look at the clicks_dist
visualization. What is the approximate range of the clicks
variable? What seems like the most common value (otherwise called the mode) of clicks
? Set clicks_mode
equal to approximate value of the clicks
mode.
Assign the result of calling cor.test()
, with conversion$total_convert
and conversion$clicks
as input parameters, to a variable called correlation
. Print out correlation$estimate
. Does the coefficient value suggest that the variables have a linear relationship?