Learn

While the linear regression is perhaps the most widely applied method in Data Science, it relies on a strict set of assumptions about the relationship between predictor and outcome variables. The most obvious (but crucial!) assumption is a linear relationship between the predictor and outcome. Following from this assumption is one key observation about any variables we want to include in our model which must be tested before building a model:

The expected value of the outcome variable is a straight-line function of exclusively the predictor variable. The best test for this relationship is quite straightforward–– we can just visualize the relationship between the predictor and outcome variables as a scatterplot. A linear relationship will resemble a straight line with a slope not equal to zero, like the relationship between spending on TV ads and the overall sales volume of the related product found in our advertising dataset.

A linear relationship between two features

We can also quantitatively test for a linear relationship by computing the correlation coefficient. The correlation coefficient is always between positive one and negative one. A coefficient close to 0 (roughly between -0.20 and 0.20) suggests a weak linear relationship between two variables. A coefficient closer to positive or negative one suggests a stronger linear relationship. In R, we can compute the correlation coefficient using the cor.test() method as follows:

coefficient <- cor.test(advertising$TV, advertising$Sales) coefficient$estimate # Output: 0.837

Instructions

1.

Load the conversion.csv into the working environment using read.csv(). Save the result to a variable called conversion, and don’t forget to set the header parameter to TRUE!

2.

A good statistical workflow always involves a thorough understanding of the data available to model and a qualitative analysis of relevant variables. Use str() to write out the structure of the dataset and list of variables types. Which variables seem like possible predictors of purchase, or total_convert?

3.

Use a combination of the base ggplot() function and geom_bar() to plot the distribution of the clicks variable, a measure of how many times a user clicked on an advertisement. Save the result to a variable called clicks_dist. Call clicks_dist.

4.

Take a closer look at the clicks_dist visualization. What is the approximate range of the clicks variable? What seems like the most common value (otherwise called the mode) of clicks? Set clicks_mode equal to approximate value of the clicks mode.

5.

Assign the result of calling cor.test(), with conversion$total_convert and conversion$clicks as input parameters, to a variable called correlation. Print out correlation$estimate. Does the coefficient value suggest that the variables have a linear relationship?

Sign up to start coding

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?